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ABSTRACT 

Knowledge>Based  lYaiisfonnational  Synthesis 
of  Efficient  Structures  for  Concurrent  Computation 

by  RICHARD  M.  KING 


The  object  of  our  research  is  the  codification  of  programming  knowledge  for  the  syn¬ 
thesis  of  concurrent  programs.  This  is  important  because  concurrency  is  a  way  of  securing 
better  performance  on  amenable  problems  than  is  available  on  non-concurrent  computers. 
We  divide  this  knowledge  into  two  sections:  knowledge  for  the  synthesis  of  arrays  of  proces¬ 
sors  that  could  be  connected  in  a  geometrically  regular  manner  (crystalline  concurrency), 
and  knowledge  for  the  synthesis  of  tree  structures  (tree  concurrency). 


We  divide  synthesis  of  crystalline  concurrency,  in  turn,  into  several  subsections:  syn¬ 
thesis  of  declarations  of  multiple  processors  and  the  wires  implied  by  the  dependencies 
among  the  values  they  contain,  reduction  of  this  wire  network  to  a  smaUer  wire  network, 
creation  of  subnetworks  to  replace  an  overly-broad  fanout  network,  virtualization  which  is 
the  creation  of  additional  array  elements  and  processors  to  reflect  the  internal  enumera¬ 
tions  that  comprise  the  computation  of  a  datum,  and  aggregation  which  is  the  merging  of 
several  processors  into  one. 


We  also  divide  tree  concurrency  synthesis.  Our  primary  technique  is  divide  and  conquer, 
but  to  make  this  technique  effective  we  must  take  another  view  of  the  specification.  We 
respecify  a  given  requirement,  that  of  computing  a  new  array  whose  values  are  pointwise 
computable  as  a  function  of  an  existing  array  and  an  index,  as  a  requirement  to  compute  a 
functional  object  whose  side  effect  is  to  satisfy  the  original  specification,  together  with  the 
requirement  that  said  object  be  called  with  the  proper  arguments.  We  call  the  computed 
functional  object  a  elozure. 
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We  use  a  transformationid  approach.  The  traDsformational  system  has  rules,  each  of 
which  contains  two  predicates:  an  antecedent  and  a  consequent.  If  the  antecedent  of  a  rule 
is  true  of  a  given  object,  the  rule  applies  and  the  object  is  modified  to  make  the  consequent 


true. 

V  We  demonstrate  these  techniques’  ability  to  synthesize  one  or  more  solutions  to  each  of 

several  classical  problems  from  the  literature.  These  solutions  are  topological  descriptions 
of  arrays  of  computing  elements  (“processors”).  The  resulting  elements’  complexities  range 
#  from  a  couple  of  gates  to  something  comparable  to  a  microprocessor.  We  do  not  attempt 

to  actually  lay  out  the  processors  and  interconnections. 
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Chapter  1 


Introduction  and  Approach 


Computation  power  can  be  delivered  in  several  denominations,  ranging  from  chips  that 
can  execute  a  few  hundred  eight-bit  instructions  per  second  up  to  large-scale  computers 
that  can  approach  a  hundred  million  64-bit  instructions  per  second.  There  is  no  reason 
to  suppose  that  the  larger  computers  deliver  more  computation  power  per  dollar  than 
the  smaller  ones.  At  present,  cheap  computer  power  seems  to  come  in  small  packages. 
We  can  informally  argue  that  a  given  problem  requires  a  certain  number  of  gates  for  a 
solution  in  a  given  amount  of  time,  whether  these  gates  are  contained  in  a  single  large 
package  or  several  small  ones.  Additionally,  in  the  large  processor  there  are  constraints 
that  force  use  of  extra  logic  to  keep  track  of  some  of  the  work  that  the  machine  is  trying  to 
overlap  with  other  related  work.  Some  of  those  gates  may,  at  times,  lie  fallow,  as  may  some 
computational  gates  in  the  large  machine  if  the  mix  of  tasks  is  instantaneously  different 
from  that  which  the  designer  assumed  during  construction.  Additionally,  we  argue  that 
the  electronics  industry  produces  a  much  greater  volume  of  computation  in  small  chunks 
than  of  computation  in  large  chunks,  and  therefore  enjoys  the  economics  of  scale. 

We  could  accept  slow  but  cheap  computation,  allowing  us  to  afford  many  computers. 
There  are  situations  where  this  is  the  proper  course  of  action.  An  arcade  owner  does  better 
to  have  a  small  computer  in  each  game  of  his  arcade  instead  of  a  single  large  processor  to 
power  all  of  the  games,  even  ignoring  the  reliability  and  engineering  problems  of  the  latter 
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approach.  Companies  are  beginning  to  supply  each  of  their  employees  with  a  desk-top 
computer  rather  than  with  a  terminal  into  a  large  computing  system. 

Slow  computation  is  not,  however,  acceptable  in  all  cases.  There  are  classes  of  ntuations 
in  which  a  certain  minimum  amount  of  computer  power  must  be  provided.  These  situations 
range  from  fast  real-time  systems  such  as  avionics,  through  situations  such  as  weather 
forecasting  where  we  might  attempt  to  model  48  hours  of  atmospheric  behavior  in  24  hours, 
where  more  computation  power  would  allow  a  finer-grained  model  of  the  atmosphere,  to 
aerodynamic  modeling  where  an  increase  in  computer  power  improves  the  situation  from 
the  point  where  it’s  better  to  experiment  on  actual  hardware  to  the  point  where  it  is  better 
to  experiment  on  computer  models. 

1.1  The  Need  for  a  VLSI  Design  Assistant 

The  number  of  devices  that  can  fit  on  an  integrated  circuit  continues  to  increase.  It  is 
expected  [MeC80]  that  there  will  be  at  least  one  more  factor  of  ten  reduction  in  the  feature 
size  of  integrated  circuits  before  physical  limits  are  reached,  giving  a  hundredfold  increase 
in  the  number  of  devices  that  can  be  integrated  on  a  chip  of  a  given  size.  Additionally,  it 
would  not  be  unreasonable  to  expect  some  increase  in  the  maximum  me  of  chips. 

At  present  it  is  practical  for  a  designer  to  specify  all  of  the  functional  blocks  of  his 
design.  Current  technology  allows  for  a  number  of  gates  on  a  chip  approaching  a  million, 
but  computer  uded  design  tools  can  allow  him  to  deal  with  the  complexity  that  this  allows 
by  specifying  circuit  information  as  logic  diagrams  rather  than  as  circuit  masks,  and  in 
some  new  qrstems  a  more  convenient  form  than  logic  diagrams  is  used  (for  example  in 
MACPITTS  [SSC82]  the  design  is  specified  in  a  LISP-like  language).  Still,  the  entire 
functional  design  comes  from  the  designer. 

In  the  future  it  will  be  possible  to  squeeze  a  hundred  times  as  much  function  on  a 
chip  as  is  now  possible.  Good  ways  must  be  found  to  exploit  this  c^>ability  and  to  create 


chips  which  make  good  use  of  a  hundred  times  as  many  gates  as  current  chips  have.  One 
technique  would  be  to  have  a  number  of  functional  blocks  on  a  chip  comparable  to  the 
number  in  current  designs.  Most  blocks  will  have  to  be  larger  than  those  of  current  chips 
in  order  to  put  as  much  functionality  on  a  chip  as  will  be  possible  without  tremendously 
increasing  the  number  of  blocks. 

One  method  for  allowing  these  larger  blocks  is  to  have  a  library  of  large  blocks  available 
to  the  designer.  This  would  be  undesirable  because  it  is  plausible  that  the  number  of  blocks 
desired  by  various  designers  is  a  rapidly  growing  function  of  their  size.  For  example,  since 
the  sizes  of  functional  blocks  would  be  at  least  comparable  to  the  size  of  current  chips, 
and  since  one  of  the  major  constraints  on  modern  chip  designs  (pin  limitations)  wouldn’t 
apply,  it  would  seem  reasonable  to  suppose  that  there  should  be  at  least  as  many  functional 
blocks  available  as  there  are  chip  types  now.  This  would  be  unacceptable. 

Another  approach  is  to  use  hierarchical  design  methods.  In  effect  each  designer  creates 
an  ad  hoe  library.  This  approach  has  the  problems  inherent  in  private  subroutine  libraries, 
including  difficulties  of  sharing  effort  in  a  large  project. 

We  have  a  fairly  close  analogy  between  the  future  situation  regarding  the  ability  of  VLSI 
fabricators  to  make  large  chips  and  the  current  ability  of  software  designers  to  “fabricate” 
software.  Language  designers  can  either  provide  more  primitive  operations  and  therefore 
hope  to  cover  the  needs  of  programmers,  or  they  can  provide  the  programmer  with  the 
option  of  building  his  own  large  building  blocks  out  of  smaller  ones.  These  are  called 
subroutines.  Designers  of  languages  like  COBOL,  APL  and  SNOBOL  attempted  to  provide 
numerous  primitive  operations  and  precisely  the  correct  ones  for  certain  problem  domains 
(although  they  felt  it  necessary  to  provide  the  ability  to  define  subroutines  as  well) .  Other 
languages  such  as  LISP  provide  few  primitive  operations  but  are  intended  to  facilitate 
creation  of  subroutines.  Limits  to  this  approach  have  been  recognized.  See,  for  example, 
[SchSO]  in  which  it  is  pointed  out  that  modularity  facilitates  software  construction  at 
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the  expense  of  efficiency,  and  [DoD83]  in  which  it  is  pointed  out  that  currently  pr(q>oaed 
software  tasks  not  oxily  cannot  be  done  efficiently  but  cannot  even  be  done  reliably  without 
using  something  more  advanced  than  modular  techniques. 

At  our  laboratory  [GCP81]  and  others  [Hor8l]  work  is  proceeding  on  knowledge  based 
software  assistants,  which  are  systems  that  allow  their  user  to  describe  a  desired  system 
behavior  is  a  specification  language.  This  thesis  describes  the  beginning  of  a  knowledge 
based  VLSI  designer  assistant. 

1.2  Goals 

Several  steps  must  be  take  to  exploit  the  cost-effectiveness  of  small  denominations  of 
computation: 

•  The  processors  need  to  communicate.  If  a  system  has  a  large  number  of  small 
processors  and  no  wires  between  them  the  only  thing  it  can  do  is  solve  a  large  number 
of  small  problems  simultaneously.  This  is  not  always  an  accurate  reflection  of  what 
people  want.  The  nature  of  the  problems  that  the  processors  will  be  able  to  solve 
quickly  depends  critically  on  the  way  they  are  interconnected.  Unfortunately,  so 
does  the  cost  of  the  multiprocessor  system,  and  it  turns  out  that  the  most  versatile 
topologies  are  those  that  cost  the  most  to  wire. 

•  The  processors  need  to  be  scheduled.  Somebody  must  decide  what  each  pro¬ 
cessor  will  do  when.  One  can  not  merely  take  an  algorithm  that  is  carefully  crafted 
to  run  on  a  single  large  processor  and  make  it  run  on  a  multiprocessor  system  with 
little  change.  In  [AMa84]  it  is  shown  that  some  problems  can  be  solved  by  efficient 
single-processor  algorithms  that  have  no  concurrent  analog.  There  are  concurrent 
solutions  to  the  original  problem,  but  the  sequential  algorithm  produces  a  •peeifie 
solution.  No  algorithm,  sequential  or  concurrent,  can  solve  the  problem  of  finding 


the  specific  solution  that  the  sequential  algorithm  would  have  produced  faster  than 
in  polynomial  time  (unless  P  C  MC  where  (for  problem  instances  of  size  n)  P  ia  the 
set  of  problems  solvable  in  0(n*)  time  (for  some  constant  t)  on  a  single  processor 
and  MC  is  the  set  of  problems  solvable  in  O(log')  time  using  O(n^  )  processors  (for 
constants  t  and  j).  It  is  believed  that  P  ^  A/C.)- 

•  The  processors  need  to  be  loaded.  Normally  some  entity  outside  of  the  problem¬ 
solving  computer  presents  the  data  representing  a  problem  instance  at  a  single  source, 
and  the  computer  is  required  to  deliver  the  results  to  a  single  destination.  The 
program  would  also  normally  be  stored  on  a  single  device  for  economic  reasons.  We 
must  address  the  issues  of  how  instruction  loading,  input  and  output  will  take  place 
across  the  single-stream/multiple-stream  interface. 

We  use  the  approach  of  synthesizing  a  concurrent  version  of  a  specification  expressed 
in  an  extremely  “high  style”.  The  reason  for  this  is  that  specifications  are  turned  into 
programs  by  the  addition  of  specialized  information  (such  as  data  structure  selection),  and 
the  removal  of  this  specialized  information  is  difficult.  This  information  imposes  constraints 
on  the  manner  in  which  the  calculation  is  carried  out,  and  these  constraints  make  it  more 
difficult  to  produce  a  program  optimal  according  to  one  set  of  criteria  from  another  program 
that  was  (previously)  optimized  for  another  set. 

Our  system,  which  we  call  TRANSCONS  (for  the  TRANSformational  CONcurrency 
Synthesizer),  accepts  input/output  specifications  in  a  high  level  language.  It  transforms 
these  into  descriptions  of  parallel  structures. 

We  will  not  concern  ourselves  with  the  placement  of  processing  nodes  or  gates  on  a 
surface  or  in  space,  even  though  the  quality  of  the  placement  can  drastically  affect  efficiency 
by  altering  wire  lengths  and  therefore  path  delays  and  costs.  We  feel  that  our  system  meets 
its  need  despite  this  omission  for  the  following  reasons; 
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•  the  topology  could  be  adequate  for  our  needs 

•  systems  for  laying  out  circuits,  given  the  topologies,  exist 

•  tiie  topolo^es  that  TIUNSCONS  synthesizes  have  ftdrly  obvious  layouts  (i.e.,  the 
coordinates  of  the  position  of  a  processor  are  linear  functions  of  its  indices) 

One  possible  use  of  the  output  of  a  TRANSCONS  run  is  to  control  the  operation  of  a 
"universal”  parallel  computer  such  as  a  shuffle  exchange  or  cube-connected  cycle  system. 
Such  a  use  of  TRANSCONS  output  might  be  made  for  testing  purposes,  but  the  expense 
of  the  universal  parallel  computers  and  the  0(logn)  factor  speed  loss  for  simulation  of 
an  n-processor  system  with  direct  interconnections  as  specified  by  TransConS  make  it 
unlikely  that  this  will  be  a  standard  use  of  this  technology. 

Universal  architectures  described  above  all  have  the  unattractive  property  that  some  of 
the  wires  must  be  long,  and  the  total  wire  length  is  long.  They  also  have  some  extremely 
untidy  wiring  layouts  in  any  physical  implementation  (necessarily  so;  since  every  surface 
that  bisects  the  network  must  be  pierced  by  a  large  number  of  wires,  the  wiring  arrangement 
contains  few  bundles  of  wires  tracing  adjacent  paths.).  Wiring  is  one  of  the  least  reliable 
parts  of  a  modem  digital  computer  system.  In  addition,  the  log  n  speed  factor  can  be  a 
serious  matter. 

It  would  therefore  be  desirable  in  some  cases  to  reduce  the  average  length  of  the  wires 
and  increase  the  orderliness  of  the  interconnections,  so  intermodule  connections  can  be 
reduced  to  interboard  interconnections,  which  can  be  reduced  to  printed  circuit  connections 
and  in  turn  to  connections  within  a  chip.  This  reduces  cost  and  increases  reliability. 

The  techniques  of  TRANSCONS  produce  topologies  that  could  easily  be  laid  out  by 
computer  as  tidy  layouts.  This  is  because  the  various  expressions  controlling  the  intercon¬ 
nections  between  the  processors  are  restricted  to  be  from  Presburger  Axithmetic,  and  they 
are  simple  expressions  linear  in  the  processors’  names. 


7 


If  TRANSCONS  produces  a  crystalline  interconnection  pattern  with  higher  dimension¬ 
ality  than  an  available  network,  it  is  still  possible  to  find  an  assignment  of  logical  processors 
to  physical  ones  that  incurs  only  a  moderate  speed  penalty.  For  example,  there  is  a  simple 
mapping  of  a  X  x  ^/n  array  of  processors  onto  a  two  dimensional  array  such  that 
the  cost  of  communication  along  two  of  the  simulated  dimensions  is  0(1)  and  that  of  the 
third  is  The  constant  factor  would  perhaps  be  one  third  of  that  of  a  simulation  on 

a  universal  computer,  because  only  one  of  the  three  directions  would  have  this  problem. 
Since  Ig  n  <  only  for  n  >  64000,  the  universal  computer  would  only  excel  on  rather  large 
problem  instances. 


It  should  probably  be  pointed  out  that  circular  reasoning  was  used  in  the  above  argu¬ 
ment.  We  selected  a  simple  form  of  enumerated  expression  (for  other  reasons)  and  observed 
that  the  parallel  structures  that  result  can  be  simply  laid  out  in  Euclidean  spaces  of  vari¬ 
ous  dimension.  This  may  make  it  necessary  to  provide  mechanisms  to  included  “canned” 
subnetworks  (i.e.,  for  sorting),  but  once  these  are  provided  the  system  will  be  reasonably 
general  and  will  retain  the  property  that  it  generates  easy-to-lay-out  networks. 


We  explore  a  aeries  of  problems  from  the  literature  of  computer  science  that  are  known 
to  have  good  parallel  solutions.  We  used  classical  problems  from  the  literature  of  computer 
science,  rather  than,  in  any  sense,  selecting  a  “random  cross  section”  of  problems  (whatever 
that  would  mean),  because  it  is  fairly  well  known  what  is  possible  in  terms  of  parallel 
structures  for  these  problems  and  we  therefore  had  targets  for  the  tools  we  were  trying 
to  build,  as  well  as  a  yardstick  agsunst  which  to  measure  the  results.  We  conjecture  that 
most  real  problems  that  take  a  lot  of  computer  resources  reduce  to  a  series  of  classical 
problems.  For  example,  in  [Knu69]  and  [KnuTS]  respectively  the  point  is  made  that 
array  manipulation  and  sorting  are  major  consumers  of  computing  resources  as  they  are 
used  today. 


1.2.1  Previotis  Work 


A  technique  nimlar  to  our  virtualization  technique  u  described  in  {Mlr83]  and 
(MWi84].  Their  technique  is  to  duplicate  all  scalers  or  array  elements  that  receive  multi¬ 
ple  assignments  and  then  to  compute  the  data  flow  based  on  stereotyped  constructs  of  a 
FORTRAN-like  specification  language.  Their  system  finds  interconnection  nets  that  meet 
certain  linear  algebraic  properties. 

The  main  diflerences  between  the  techniques  in  the  above  paragraph  and  our  virtual¬ 
ization  is  in  the  form  of  data  dependency  allowed.  The  use  of  linear  algebra  for  dependency 
analysis  allows  application  of  these  techniques  only  when  information  flows  from  an  array 
element  to  another  array  element  whose  coordinates  are  linear  functions  of  the  coordinates 
of  the  first  element.  While  the  prototype  TRANSCONS  uses  linear  algebra  in  place  of  a 
more  general  theorem  prover  and  therefore  shares  this  property,  the  form  of  the  rule  and 
the  compartmentalization  of  the  information  makes  addition  of  new  knowledge  simple. 

One  consequence  of  this  is  that  there  is  no  notion  of  aggregation.  Because  of  the 
finality  of  the  result  it  can  not  be  aggregated  conveniently,  and  it  is  therefore  a  reasonable 
technique  only  where  there  is  a  constant  amont  of  work  per  processor  element  already. 

Numerous  systems  exist  for  creating  VLSI  layouts  or  VLSI  topological  descriptions 
from  low  level  description  languages.  In  each  case  we  will  only  cite  one  or  two  examples, 
with  no  intent  to  imply  anything  about  those  that  are  chosen  on  the  one  hand  or  left  out 
on  the  other. 

MACPITTS  ([SSC82]),  from  the  MIT  Lincoln  Laboratory,  can  produce  VLSI  layouts 
from  a  LLSP-like  language  which  includes  constructs  like  (SETQ  . . . )  to  create  a  signal, 
(-h  . . . )  etc.  to  specify  arithmetic  or  logical  operations  on  signals,  and  looping  constructs. 
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MACPITTS  determines  and  places  the  minimum  number  of  functional  modules  to  "exe¬ 
cute”  a  given  "program” ,  creates  a  programmed  logic  array  (PLA)  to  control  these  modules, 
and  lays  out  the  wires  among  these  parts. 

The  Palladio  system  ([HTF83]}  allows  a  user  to  interactively  create  a  VLSI  topology 
by  "discussing”  with  the  system  what  is  to  be  done. 

These  VLSI  design  systems  have  as  their  primary  goal  the  avoidance  of  the  electronics 
pitfalls  such  as  capacitance  problems  or  delays  on  long  lines  and  violations  of  the  "design 
rules”  of  the  technology  that  would  make  fabrication  unreliable. 

The  communication  of  actors  or  closures  between  processors  to  model  communication 
of  problem  data  between  processors  has  been  current  since  [AHe77]  and  [The82].  Here 
actors  are  separate  objects  that  do  their  work  by  sending  and  receiving  messages.  Such 
a  transmission  is  an  event.  A  message  can  be  an  actor.  Conceptually  the  actors  have 
independent  existence,  but  of  course  they  must  have  some  physical  realization  and,  as¬ 
suming  machines  capable  of  processing  messages  by  and  for  actors  contained  therein  are 
called  processors,  the  passing  of  messages  among  actors  in  different  processors  can  model 
communication  among  the  processors.  In  this  work  the  actor  is  an  object  that  entitles  its 
holder  to  perform  some  action  by  invoking  it,  and  it  can  be  passed  from  one  process  to 
another.  The  receiving  process  can  invoke  it,  and  the  work  is  performed  in  an  environment 
derived  from  the  processor  that  created  the  actor.  In  this  thesis  we  will  use  closures,  which 
are  objects  similar  to  actors  that  can  only  be  invoked  once  and  then  cease  to  exist. 

Divide  ic  conquer  is  a  powerful  synthesis  technique  for  efficient  sequential  programs  (see 
[Smi83a]  and  [Smi83b]),  but  it  has  not  been  widely  used  to  synthesize  tree-structured  col¬ 
lections  of  communicating  processors,  even  though  such  an  application  would  seem  obvious 
because  of  the  correspondence  between  the  division  process  of  divide  &  conquer  and  the 
branching  structure  of  the  desired  collection  of  processors.  The  reason  for  this  is  that  the 
synthesis  process  encounters  technical  problems  when  one  tries  to  perform  such  a  synthesis 
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in  the  obvious  manner.  The  use  of  closures,  similar  to  actors,  mitigates  these  technics! 
problems  at  the  expense  of  requiring  a  more  general  theorem  prover  than  is  required  to 
perform  divide  k.  conquer  syntheses  of  sequential  programs. 

1.3  Approach 

TRANSCONS  specializes  in  two  areas.  As  the  first  specialty  it  can  synthesize  crystalline 
networks  of  processors,  in  which  members  of  a  family  of  processors  can  be  described  by 
veckns  of  indices  (integers  initially;  in  principle  any  ordered  set).  In  such  a  structure,  each 
processor  is  connected  to  those  other  processors,  each  at  a  fixed  distance  and  direction 
from  this  processor,  that  exist  (if  we  visualise  the  network  as  a  group  of  processors,  each 
occupying  a  point  with  integer  coordinates  in  Euclidean  space  of  appropriate  dimension). 
As  the  other  specialty  it  can  tynthesize  balanced  binary  tree  structures  in  which  the  internal 
nodes  all  run  the  same  procedure. 

In  all  cases  the  synthesis  process  starts  with  specifications  in  the  V  language  (see 
[GreSl],  [GCP81],  and  (KesSS]).  V  is  a  broad-spectrum  language  baaed  on  first  order 
logic  (FOL)  but  containing  locutions  ranging  from  FOL  to  LISP-  or  Pascal-like  specifica¬ 
tions  of  individual  operations,  data  structures,  and  values.  We  use  this  language  for  several 
reasons: 

•  The  language  is  a  good  one  for  specifying  rules  used  to  transform  specifications  as 
well  as  the  specifications  to  be  transformed.  It  shares  with  LISP  and  RAPTS  [Pai82] 
the  property  that  programs  in  the  language  are  normally  expressed  as  instances  of 
the  data  structures  such  programs  most  easily  manipulate. 

•  These  specifications  are  similar  to  first  order  logic  expressions.  A  theorem  proving 
capability  is  essential,  and  much  work  has  been  done  on  the  problems  of  automatic 
theorem  proving  in  first  order  logic  expressions. 


•  Use  of  a  broad-spectrum  language  facilitates  a  stepwise  refinement.  If  the  source 
and  target  languages  were  distinct  rather  than  being  parts  of  a  single  language,  the 
creation  of  target  text  from  source  text  would  need  to  be  conceptually  a  single  step. 
An  intermediate  form  that  could  hold  both  source  and  target  locutions  would  have 
to  be  provided  unless  the  process  actually  was  so  simple  that  a  single  examination  of 
any  source  object  was  sufficient. 

We  augment  the  V  language  cited  above  with  several  constructs  designed  to  specify 
interconnected  collections  of  similar  processors. 

1.3.1  Parallel  Structure  Refinement 

We  develop  a  series  of  models  of  the  parallel  computation  process.  In  the  highest  level, 
most  details  are  unspecified;  in  the  intermediate  level,  the  order,  but  not  the  timing,  of 
various  conomunications  and  computations  is  described;  and  finally  in  the  two  lowest  levels, 
the  notion  of  a  clock  is  introduced.  In  the  higher  of  these  two  levels,  the  time  at  which 
various  operations  can  take  place  is  determined  algebraically.  In  the  lower  level,  time 
differeneee  between  actions  are  computed.  This  can  be  used  directly  in  a  VLSI  synthesis. 
A  computation  in  which  operands  are  available  simultaneously  and  the  result  is  needed 
one  cycle  later  can,  for  example,  be  performed  by  combinatorial  logic  with  a  single  “latch” 
connected  to  the  output. 

The  last  stage  of  the  refinement  is  future  work,  but  we  argue  that  it  will  meld  well 
with  the  rest  of  TRANSCONS,  and  that  TRANSCONS  will  then  be  able  to  transform  first 
order  logic  specifications  into  circuit  descriptions  lacking  only  device  placement  steps  to  be 
complete  VLSI  chip  descriptions. 


The  series  of  models  is  such  that  a  coherent  parallel  structure  results  from  stopping  the 
synthesis  process  at  any  level.  Some  levels  can  not  be  reached  by  some  specifications,  and 


the  lowest  levels  may  contain  more  detail  than  is  desired.  The  user  can  control  the  extent 
of  the  synthesis  process. 

1.3.2  Crystalline  Methods 

Crystalline  structure  synthesis  begins  with  transformations  embodying  data  flow  anal¬ 
ysis  and  analysis  of  expressions  comprising  indices  of  references  to  array  elements.  Each 
intermediate  datum  in  an  array  of  the  specification  is  assigned  to  a  processor  whose  index 
corresponds  to  the  index  of  the  datum.  The  results  of  this  simple  analysis  is  generally  a 
clumsy  structure  in  which  each  processor  is  connected  to  many  other  processors  -  too  many 
to  be  practical.  Often  there  are  other  weaknesses  in  the  structures.  Additional  techniques 
are  therefore  necessary  to  produce  usable  parallel  structures,  and  TRANSCONS  contains 
rules  that  embody  these. 

The  first  and  most  important  of  these  techniques  is  eommunieati0n  reduction.  In  this 
technique,  a  rule  seeks  a  set  of  communication  lines  whose  transitive  closure  is  equal  to  (or 
includes)  a  distinguishable  subset  of  a  given  communication  network.  After  replacing  that 
portion  of  the  network  with  the  smaller  set,  we  repeat  the  process. 

A  second  technique  is  aggregation,  or  the  collection  of  many  processors  into  one.  This 
technique  has  two  important  uses;  reducing  the  number  of  processors  in  a  system  when 
each  has  too  little  work  to  do,  and  gluing  together  simple  networks  to  make  more  complex 
ones. 

A  third  technique  is  virtualization,  or  the  increase  of  the  dimensionality  of  a  data  struc¬ 
ture  by  explicating  a  loop  that  repeats  assignments  to  an  internal  register.  TRANSCONS 
normally  uses  virtualization  together  with  aggregation,  because  the  former  creates  numer¬ 
ous  processors  with  little  work  for  each  of  them  to  do,  and  the  latter  combines  processors. 
Every  virtualization  has  an  inverse  aggregation,  but  whenever  a  virtualization  creates  an 
array  of  processors  with  more  than  two  dimensions  there  is  more  than  one  aggregation,  and 


13 


Trans  Cons  has  the  option  of  choosing  a  different  one  from  the  inverse  of  the  original 
virtualization. 


The  fourth  technique  is  chain  creation.  When  an  asymptotically  large  number  of  con¬ 
nections  exists  between  an  I/O  processor  and  the  working  processors,  and  TRANSCONS 
wants  to  reduce  this  number,  it  tries  to  apply  this  technique.  If  the  sets  of  values  used  in 
the  working  processors  can  be  grouped  properly,  then  the  wiring  can  be  rearranged  so  only 
a  few  distinguished  working  processors  are  connected  to  input  processors,  and  the  rest  can 
receive  information  “second  hand”  from  other  working  processors.  Similarly,  for  output, 
information  can  be  collected  from  several  working  processors  connected  in  a  chain  and  sent 
to  the  outside  world  via  a  few  distinguished  processors. 

We  restrict  the  forms  of  the  expressions  in  the  declarations  describing  the  processors 
and  their  interconnections,  because  use  of  a  theorem  prover  is  required  for  all  of  these 
techniques  and  the  restrictions  make  this  much  more  feasible. 

We  argue  the  adequacy  of  TRANSCONS’s  techniques  for  crystalline  synthesis  by  show¬ 
ing  some  syntheses  of  parallel  structures  for  dynamic  programming  and  two  structures  for 
multiplication  of  matrices. 


1.3.3  IVee  Methods 


We  use  divide  ic  conquer  as  the  primary  synthesis  tool  for  the  creation  of  tree  parallel 
structures.  This  technique  has  a  long  history  of  creating  efficient  sequential  programs  from 
specifications.  While  it  would  appear  that  the  synthesis  of  a  tree  structure  by  divide  ic 
conquer  should  be  immediate  because  of  the  correspondence  of  subproblems  and  subtrees, 
there  are  issues  that  require  resolution. 

We  therefore  introduce  the  notion  of  passing  a  closure,  or  functional  object,  between 
processors.  While  this  is  not  novel  (see,  for  example,  [Hew76]),  our  use  of  it  is.  We  are 
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given  a  specification  of  I/O  behavior  of  two  arrays  (one  an  input  and  one  an  output).  We 
transform  this  into  two  specifications:  I/O  behavior  mapping  an  input  array  into  a  fimc- 
tional  object  exhibiting  certain  behavior  given  the  input  array  of  the  original  specification, 
and  a  request  that  this  functional  object  be  applied.  As  we  demonstrate,  using  the  com¬ 
bination  of  divide  ti  conquer  and  the  computation  of  well-chosen  closures,  TRANS  CONS 
is  able  to  synthesize  a  variety  of  tree  parallel  structures  which  can  solve  problems  ranging 
in  complexity  from  census  functions  [LiVSl]  (which  require  no  closures)  to  parts  of  a  con¬ 
nected  components  computation,  in  which  a  graph’s  adjacency  matrix  is  read  in  row  by 
row,  and  the  parallel  structure  ‘learns”  what  the  sets  of  points  are  such  that  there  exists 
a  path  from  any  node  in  a  set  to  any  other  node  in  the  same  set. 

1.3.4  Additional  Techniques 

As  part  of  a  demonstration  of  the  power  of  our  techniques,  we  synthesize  three  cir¬ 
cuits  for  the  addition  of  binary  numbers.  The  use  of  different  combinations  of  techniques 
produces  circuits  occupying  three  places  in  a  spectrum  of  speed/cost  tradeoffs. 

The  first  order  logic  specification  for  binary  addition  has  nested  bounded  quantifiers 
arranged  in  such  a  manner  that  the  bounds  of  the  inner  quantifier  depend  on  the  bound 
variable  of  the  outer  one.  Therefore,  the  parallel  structure  synthesized  by  the  previous 
methods  of  this  thesis  has  O(n’)  boolean  values  to  compute  for  addition  of  two  n-bit 
numbers.  To  accomplish  this  in  O(logn)  time  would  require  O(n’)  processors.  We  use 
a  series  of  axioms  and  theorems  relating  the  max,  min.  A,  V,  3  and  V  operators.  An 
example  of  a  necessary  theorem  is  V/  <  z  <  u(P(z)]  =  max^  <  u[~  f’(>)]  <  I,  which  restates 
a  universally  quantified  expression  bounded  by  an  integer  subrange  into  a  maximization. 
The  axioms  and  theorems,  the  proofs,  and  their  use  are  described  in  Chapter  7.  The 
process  requires  a  theorem  prover  general  enough  to  accept  the  axioms  and  to  either  prove 
or  accept  the  theorems  relating  these  operators.  We  achieve  a  specification  in  which  a 


pair  of  nested  quantifiers  is  replaced  by  an  arithmetic  comparison  of  two  bounded  max 
operations.  We  call  the  entire  process  quantifier  levelling. 


1.4  Organization 

The  next  Chapter  (after  this  introduction)  gives  formal  descriptions  of  the  four  levels 
of  synthesis  detail  that  TRANSCONS  will  be  capable  of  when  it  is  complete.  The  third 
Chapter  describes  the  abilities  of  TRANSConS  that  facilitate  crystalline  synthesis. 

Chapter  four  discusses  the  divide-and-conquer  method  for  synthesizing  treelike  parallel 
structures,  explores  some  of  the  problems  that  must  be  solved  to  make  it  work,  introduces 
the  notion  of  a  closure  to  solve  these  problems,  and  introduces  the  language  we  use  to 
describe  resulting  structures.  Chapter  five  gives  several  examples  of  the  synthesis  of  tree 
structures  by  these  methods,  and  discusses  in  detail  methods  for  removing  the  closures, 
which  are  a  necessary  “scaffolding”  for  the  synthesis  process  but  not  intended  for  the  final 
product. 

The  sixth  Chapter  shows  a  case  in  which  use  of  mathematical  identities  makes 
TRANSCONS  more  powerful  than  it  otherwise  would  have  been.  It  lends  credence  to 
our  conjecture  that  our  synthesis  tools  will  turn  out  to  be  more  powerful  than  it  would 
seem  from  the  apparently  specialized  nature  of  problems  we  solve  in  Chapter  s  three  and 
six. 


The  seventh  and  last  Chapter  explains  the  significance  of  our  results  and  the  future 
paths  we  expect  this  research  to  take.  A  successful  pursuit  of  this  future  research  will 
enhance  TRANSCONS  to  an  extent  that  it  will  be  able  to  automatically  synthesize  most 
of  the  parallel  structures  that  have  been  created  by  hand,  plus  structures  of  comparable 
difficulty  that  have  not  been  created  yet,  either  because  the  need  for  them  has  not  yet 
arisen  or  because  they  are  so  specialized  that  the  effort  has  not  been  deemed  worthwhile. 
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The  appendix  contains  four  sections:  sample  dialogs  with  a  complete  TRANSCONS, 
formal  proof  of  correctness  of  one  of  the  most  important  rules,  formal  proofs  of  the  identities 
used  in  Chapter  six,  and  a  description  of  the  theorem  proving  requirements  of  the  crystalline 
synthesis  portion  of  TRANSCONS. 


Chapter  2 


Formal  Descriptions 


The  target  of  TraNSCONS  synthesis  has  four  levels  arranged  in  a  hierarchy.  These 
range  from  a  high-level  description  of  computation  activity  to  a  level  of  description  that 
lacks  only  device  placement  to  be  suitable  for  VLSI  implementation.  TRANSConS  refines 
specifications  from  the  higher  levels  of  this  hierarchy  to  the  lower  by  adding  specialized 
information. 

A  higher  level  differs  from  a  lower  level  by  requiring  more  capabilities  in  the  implement¬ 
ing  hardware.  As  an  example,  the  highest  level  (“multiprogramming”)  assumes  that  the 
implementing  hardware  is  able  to  store  indefinite  amounts  of  information  and  to  process 
each  piece  of  information  using  a  separate  virtual  processor  (usually  called  a  “process”). 
The  lowest  level  requires  only  that  each  processing  element  be  able  to  compute  some  sim¬ 
ple  function  of  all  inputs  present  at  one  clock  cycle  e  and  to  present  the  aaswer(s)  at  its 
output(s)  at  a  clock  cycle  c  +  i  where  t  is  a  constant  integer  dependent  on  the  processing 
element.  We  provide  coherent  synthesis  levels,  rather  than  merely  describing  a  process  in 
which  the  specification  becomes  more  and  more  refined  but  in  which  parts  of  the  specifi¬ 
cation  may  be  in  intermediate  states  that  are  not  meant  to  be  used  by  any  entity  except 
continued  TRANSConS  synthesis,  for  two  reasons.  The  first  is  that  it  is  not  possible  to 
reduce  every  specification  to  the  lowest  level,  and  we  therefore  need  coherent  intermediate 
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levels  as  targets  for  these  specifications.  The  second  is  that  we  may  have  hardware  avail¬ 
able  that  can  meet  the  requirements  of  the  higher  levels,  so  we  may  choose  to  stop  even 
though  it  would  be  possible  to  proceed. 


We  progress  from  higher  to  lower  levels  by  adding  specializing  information  to  the  spec¬ 
ifications.  Tb  illustrate  the  various  levels  of  representation  we  will  use  the  following  speci¬ 
fication: 


VA3A' 


Vie{l...n}  oJ=  53  I- 


The  actual  reduction  operator,  here  shown  as  addition,  is  unimportant;  what  is  important 
is  that  it  be  implementable  as  combinatorial  logic  in  VLSI.  We  intend  to  show  implementa¬ 
tions  that  might  result  from  taking  this  specification  through  all  four  levels  of  TransCONS 
synthesis.  The  reduction  operator  must  be  implementable  in  VLSI  because  we  intend  to 
display  an  implementation  of  the  specification  which  uses  the  reduction  operation,  among 
other  things,  as  an  atomic  operation. 


2.1  Multiprogramming 


A  paraUel  structure  in  this  level  consists  of  procedures,  processors,  and  pools.  A  pro¬ 
cessor  contains  one  or  more  processes,  each  of  which  contains,  in  turn,  a  procedure  and  a 
(possibly  null)  index  variable  binding. 


A  processor  is  described  by  a  processors  declaration,  whose  components  are  given  in 


this  tree  diagram: 


PROCESSORS  (  index  variables  )  enumerators  for  index  variables 
I  HAS  array  name  (  array  index  variables  )  enumerators  for  same 
I  HEARS  processor  family  name  (  indices  )  enumerators 
I  I  (USES  array  name  (  indices  )  enumerators) 

I  TALKS  processor  family  name  (  indices  )  enumerators 
I  I  (SENDS  array  name  (  indices  )  enumerators) 

I  LINKS  processor  family  name  (  indices  )  p  f  name  2  (  indices  )  enumers 
I  I  (PASSES  array  name  (  indices  )  enumerators) 

Any  of  the  subclauses  can  have  a  condition  attached  to  it  which  will  specify  that  the 
subclause  only  applies  to  instances  of  the  processor  family  or  enclosing  clause  for  which  the 
condition  is  true.  The  condition  is  restricted  to  Presburger  Arithmetic  expressions  whose 
free  variables  are  variables  that  are  bound  further  outward  in  the  processors  statement, 
or  not  bound  at  all.  The  theorem  prover  will  assume  nothing  (beyond  type  information) 
about  an  unbound  variable,  which  we  will  call  a  auperglobal  in  the  following.  We  use  the 
phrase  ($ub)elau»e  instantiation  to  describe  the  instance  of  any  clause  or  subclause  with 
specific  values  of  the  bound  variables.  A  processors  declaration  is  a  type  that  is  attached 
to  a  name,  which  becomes  the  name  of  a  processor  family. 

There  are  consistency  requirements.  If  processor  A  HEARS  or  LINKS  from  processor 
B,  then  processor  B  must  TALK  (to)  or  LINK  to  processor  A.  If  a  HEARS  clause  has  a 
USES  subclause,  the  HEARd  processor  must  either  PASS  or  SEND  that  value  within 
the  corresponding  LINKS  or  TALKS  clause.  Note  that  this  imposes  a  condition  on  clause 
instantiations,  not  merely  clauses. 

A  procedure  contains  one  or  more  statements  from  the  V  language.  These  include 
references,  assignments,  reduction  operations,  other  operations,  enumeration  descriptions 
and  block  structure.  For  every  reference  in  the  procedure  of  a  process  it  must  be  true  that 
the  PROCESSORS  statement  for  that  process  has  either  a  USES  clause  or  a  HAS 
clause  for  that  reference. 


Each  reference  to  a  value  that  is  only  available  from  another  processor  is  an  abbreviation 
for  a  ‘guarded  command”  [HoaTS]  whose  guard  is  the  availability  of  the  datum  and  whose 
action  is  the  retrieval  of  the  datum  to  the  point  of  invocation  of  the  reference. 

Each 

HEARS/TALKS,  HEARS/LINKS(to).  LINKS(from)/TALKS  and  LINKS(from) 
/IiINKS(to)  pair  denotes  a  pool  In  addition  there  is  a  angle  pool  in  each  processor  for 
local  memory.  The  consistency  rules  require  that  there  will  be  correq>onding  subclauses 
in  each  of  these  clause  pairs.  This  means  that,  for  example,  everything  that  is  SENt  is 
USEd.  It  also  requires  that  both  ends  of  a  link  be  present.  Each  of  these  subclause  pairs 
represents  a  name.  The  variable  name  together  with  its  indices  is  used  as  the  name  in 
the  local  memory  pool.  Each  SENDS  clause  instantiation  must  match  a  USES  clause 
instantiation,  and  this  condition  can  only  be  met  if  the  clauses  themselves  match. 

There  are  relationships  between  this  model  and  the  data  flow  machine  models.  See,  for 
example,  [TAmSS],  although  the  mechanism  of  that  model  differs  from  the  mechanism  we 
use.  With  data  flow  machines,  each  operation  is  represented  by  an  object  (sometimes  a 
word  in  a  memory,  sometimes  a  physical  processor)  which  has  a  name  and  which  gives  the 
name  of  one  or  more  operands.  Its  name  can  be  used  as  an  operand  in  other  operations. 
With  our  model  the  code  fragments  for  the  processors  correspond  to  the  operators,  the 
USES  clauses  to  the  operand  names,  and  the  SENDS  clauses  to  the  exported  operands. 

There  are  two  ways  that  multiple  processes  in  one  processor  can  be  specified.  One  is 
by  declaring  multiple  procedures  in  one  processor.  The  other  way  is  with  enumeration 
statements.  These  are  of  the  form 


(in  processor  P)  : 

Vw  €  « 
procedure 
end 

where  s  is  set-valued.  (The  fragment  (in  processor  P)  is  the  declaration  that  the  following 
procedure  runs  in  processor  P.)  There  is  no  commitment  to  a  specific  order.  In  this  form 
of  multiprogramming,  a  process  is  created  for  each  instantiation  of  the  V  variable.  The 
processes  have  identical  procedures  except  for  this  instantiation. 

A  pool  contains  triples  of  the  form  (name,  index,  value).  Several  operations  are  defined 
on  pools,  and  the  accesses  to  the  pools  that  appear  in  the  procedures  must  be  drawn  from 
this  set. 

A  reference  has  the  form  GE!T{option8,  pool)  or  GEn'{optiona,  pool,  name)  or 
GBT{option8,  pool,  name,  index),  optiona  is  a  two-tuple  of  one  of  deatruetive,  mark  = 
{mark),  and  nil;  and  either  hang  or  teat.  The  semantics  of  this  is  that  the  pool  is  checked 
for  the  presence  of  any  datum  matching  as  much  as  we  know  about  the  name  and  index. 
We  return  the  value  if  there  is  one  and  we  either  return  false  if  there  isn’t  and  we  were 
testing,  or  we  suspend  progress  of  the  process  if  we  weren’t.  The  first  part  of  the  option 
describes  what  we  do  next  if  the  retrieval  was  successful.  If  we  were  destructive,  we  delete 
the  item;  if  we  were  marking,  we  mark  the  item  so  that  a  subsequent  retrieval  with  the 
same  mark  =  {mark)  will  not  succeed  with  this  item,  and  if  the  first  option  was  nil  we  do 
nothing  (and  another  retrieval  request  might  pull  the  same  item). 

A  store  is  of  the  form  PUT[pool ,name,index,value).  This  modifies  the  state  of  the 
world  so  that  a  GET{optiona,pool)  can  equal  {index,  value),  GET{optiona, pool,  name)  = 
value,  and  GEH' 

{optiona,  pool,  name,  index)  =  value. 


Grouping  of  the  processes  within  a  processor  into  superprocesses,  which  are  collections 


of  processes  that  intercommunicate  more  than  other  paiis  of  processes  within  a  processor, 
are  provided.  They  are  specified  hy  collecting  the  processes  within  a  group  into  a  suppos¬ 
edly  independent  ‘processor”,  and  then  collecting  the  ‘processors'  with  an  AGGREGA¬ 
TION  declaration.  The  AGGREGATION  declaration  has  all  of  the  cmnponents  of  a 
PROCESSORS  declaration,  plus  a  possibly  enumerated  list  of  the  processors  it  contains. 
There  is  a  consistency  rule  that  requires  that  each  AGGREGATION  HAVE,  HEAR 
etc.  everything  that  its  components  HAVE,  HEAR  etc. 

All  of  the  values  described  in  this  Section  can  be  closures  (see  Chapter  5)  as  well  as 
ordinary  values. 


A  simple  multiprogramming  solution  to  our  sample  specification, 
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would  be  described  as: 


A  istype  INBOUND  ARRAY({1 . . .  n}) 

Pa  Istype  PROCESSORS  HAS  €  {l...n} 

TALKS  Pbi,j  €  {i  -I- 1 . . .  n}  (SENDS  A) 

A!  istype  OUTBOUND  ARRAY({1 . . .  n}) 

Pa'  istype  PROCESSORS  HAS  A;,t  €  {l...n> 
HEARS  Phi  (USES  R) 

B  istype  ARRAY({1 . . .  n}) 

Pb  istype  PROCESSORS  t,t  €  {1 ...  n}  HAS  R 
HEARS  Pa  (USES  A) 

TALKS  PVi  (SENDS  R) 

B'  istype  ARRAY({l...n}) 

Pb'  istype  PROCESSORS  i,i  e  {1  ...n}  HAS  B? 
HEARS  Pbi  (USES  B.) 
if  1  <  t  <  n  then 

LINKS  P6;_„  Pt;+, 

(PASSES  B,,ie  {l...i-l}) 
if  1  <  t  then 

HEARS  PV^_^  (USES  B„i€  -  1}) 


{in  Pbr)  : 
temp  *—  0 
Vi€{l...i} 

temp  <—  temp  +  Bj 
end  V 
B'f  <—  temp 


Some  irrelevant  detail  has  been  ignored,  but  key  points  are  that  all  enumerations  are 
unordered,  and  that  in  Bj  there  are  t  processes  waiting  to  finish.  The  enumeration  is  an 
abbreviation  for  text  that  updates  temp  and  keeps  track  of  whether  it  is  complete  so  the 
assignment  B{  *—  temp  can  be  m2uie. 


2.2  Single  Process  per  Processor 


This  model  is  similar  to  ‘Multiprogramming”  except  for  three  features. 


•  The  memory  pools  in  this  model  are  ordered.  There  are  potentially  multiple  pools 
per  HEARS,  etc.  clause,  as  one  must  be  provided  for  each  USES,  etc.  clause.  These 
pools  are  either  stacks  or  push-down  lists,  and  the  enumerations  within  the  USES, 
etc.  clauses  must  be  ordered  (enumerating  through  a  sequence  rather  than  a  set). 

•  The  SENDS  and  corresponding  USES  clauses  can  either  have  instantiations  in 
the  same  or  the  opposite  order,  making  the  communication  channel  a  pipeline  or  a 
pushdown  stack  respectively.  This  applies  to  communications  links  (where  the  source 
and  destination  are  different  processors)  and  to  memory  pools  (where  they  are  the 
same). 


•  •  V 


•  Only  one  process  is  allowed  in  each  processor. 


•  -I 

I  • 


This  is  s  lower-level  model  than  "Multiprograinining”  because  of  the  lack  of  the  require¬ 
ment  that  the  hardware  processing  elements  be  able  to  run  multiple  programs  effectively 
simultaneously.  As  before,  more  information  must  be  supplied,  cmsisting  of  modifications 
to  the  program  to  explicitly  test  for  the  availability  of  required  data,  and  declarations  of 
pools  as  separate  objects. 

It  is  possible  to  transform  a  static  collection  of  simultaneously  running  programs  into  a 
single  program  with  the  same  effect,  provided  that  none  of  the  constituent  programs  enters  a 
nonterminating  loop  that  performs  no  access  to  any  pool.  One  such  set  of  transformations 
would  supply  a  noaster  control  program  which  would  have  as  coroutines  copies  of  the 
procedures  of  each  of  the  multiple  processes  to  be  simulated.  Each  access  to  a  pool  in  one 
of  the  simulated  processes  must  be  preceded  by  a  test  to  see  whether  there  is  something 
there,  and  if  there  isn’t  the  “process”  co-retums  to  the  master  control  program.  It  is  clear 
that  this  preserves  correctness,  and  if  we  assume  that  the  processor  has  enough  power  to 
do  the  work  assigned  to  it,  we  will  not  see  a  situation  in  which  some  work  does  not  get 
done  because  some  process’s  program  runs  indefinitely,  always  finding  work  to  do. 

This  transformation  can  only  be  performed  if  a  constant  number  of  processes  are  to  be 
folded.  Code  that  satisfies  the  previous  model  may  have  a  process  per  pool  datum.  For 
this  reason  we  must  impose  the  restriction  that  unordered  enumerations  are  not  permitted. 
The  pools  must  be  turned  either  into  queues  or  pushdown  stacks. 

A  pool  is  an  object  whose  type  is  pool.  It  has  a  stack?  property  and  it  can  have  (or 
lack)  a  name.  A  pool  is  associated  with  every  USES,  etc.  clause.  Every  pool  is  shared 
by  one  SENDS  or  PASSES  clause  as  a  source  and  one  USES  or  PASSES  clause  as  a 
sink^.  The  pool  has  source  and  sink  properties,  and  the  USES,  etc.  clauses  have  pool 
properties. 

*It  is  possible  for  different  instantiations  of  a  single  PASSES  clause  to  be  both  the  source  and  the  sink. 


Only  a  single  process  per  processor  can  exist.  The  procedure  may  contain  enumerar 
tions,  but  they  must  be  ordered  and  they  denote  sequential  composition  and  not  parallel 
composition  of  tasks. 


On  this  level  it  is  important  how  many  uses  are  made  of  a  given  datum,  because  the 
logical  connection  between  a  datum  and  its  use  is  made  by  counting.  For  this  reason  some 
streams  are  duplicated,  i.e.,  information  is  put  in  at  one  stream  and  removed  at  several. 
In  this  case,  language  like 


(USES  o . . .) 
(SENDS  b...) 
(SENDS  c...) 

(in  Pi)  : 
b,c  *—  a 


will  be  used  to  describe  the  copying  of  the  stream  that  supplies  a  into  ones  that  supply 
6  and  c. 


Part  of  a  parallel  structure  that  satisfies  the  benchmark  specification  on  this  level 


follows; 


A  istype  INBOUND  AB.RAY([1 . . .  nj) 

Pa  istype  PROCESSORS  HAS  €  [l...n] 

TALKS  PbjJ  €  (i  +  1 . . .  n]  (SENDS  A^) 

B'  istype  ARRAY([l . .  .n]) 

Pi'  istype  PROCESSORS  t,t  €  [l...n]  HAS 
HEARS  Pbi  (USES  P.) 
if  1  <  t  <  n  then 

LINKS  PkJ+i 

(PASSES  BjJ  e  (i  -  1 ...  1]) 
if  t  <  n  then 

LINKS  Pii.PiJ+i 
(PASSES  Bi) 
if  1  <  t  then 

HEARS  P6;._i  (USES  BjJ  e  [i  -  1 ...  1]) 

(m  Pii)  : 
temp  0 
for  i  €  [i  -  1 ...  1] 
temp  *—  temp  +  Bj 
end  for 
B\  *-  temp 

The  differeoce  between  this  solution  and  the  previous  one  is  that  all  orderings  are 
explicit.  Note  that  the  j  enumerations  are  "backward”.  The  for  operator  replaces  the 
V  operator  of  the  previous  parallel  structure  description.  Instead  of  specifying  unordered 
and  therefore  potentially  concurrent  executions  in  one  processor,  it  specifies  "ordinary” 
looping. 

2.3  Clocked 


In  the  clocked  level  there  is  an  object  called  a  clock.  It  is  permitted  to  take  values  from 
an  ordered  domain  (conceptually,  vectors  of  integers;  usually  a  scalar  of  type  integer).  The 
print  prototypes  (syntax  declarations)  of  TRANSCONS  allow  every  USES,  etc.  clause  to 
have  an  AT  clause.  Consistency  rules  require  that  effect  follow  cause.  For  example,  no 
SENDS  clause  instantiation  occurs  AT  any  time  before  the  computation  it  depends  on 


27 


9 


9 


I 


USES  all  of  its  values,  plus  an  amount  that  depends  on  the  nature  of  the  computation  as 
described  below. 

The  additional  specializing  information  beyond  “Single  Process”  is  the  AT  information. 
The  value  of  the  AT  property  of  a  USES,  etc.  clause  is  a  parametric  expression  in  variables 
bound  in  the  scope  of  the  clause.  This  allows  the  hardware  of  a  processor  to  be  simplified 
in  several  ways,  depending  on  some  of  the  values  of  the  AT  clauses. 

The  programs  can  be  rewritten  to  not  test  whether  data  is  available.  Data  is  assumed  to 
be  available  at  the  appropriate  times,  and  the  hardware  can  merely  “gate  in”  data,  or  read 
the  port  without  regard  to  signals  describing  the  presence  or  absence  of  data.  Similarly, 
data  can  be  written  without  regard  as  to  whether  there  is  room  for  it. 

All  information  present  in  “Single  Process”  is  present  for  this  level  also.  A  few  additional 
elements  are  added.  There  is  a  mapping  T  :  8  —*  i  where  s  is  an  element  of  V  syntax  (a 
node  type)  and  i  is  an  integer.  T{8)  wUl  be  called  the  intrinsic  delay  of  type  s.  The 
interpretation  of  this  is  that  if  s  is  an  operator  (e.g.,  +)  then  T(s)  is  the  time  required 
to  perform  the  operation,  and  if  s  is  atomic  then  T(s)  is  the  time  required  to  develop  the 
value.  The  time  to  evaluate  a  +  6,  for  example,  is  2r((variable  reference))  +  r(+). 

T(m)  where  m  is  a  reduction  operation  is  the  time  for  a  single  step.  This  means  that 
a  single  value  or  set  of  values  will  necessarily  be  absorbed  after  t  time  units. 

A  “Single  Process”  speci&cation  can  be  converted  into  a  “Clocked”  specification  by 
addition  of  a  clock  declaration,  and  assigning  times  to  each  of  the  USES  clauses.  The 
transformation  rules  assign  a  time  of  0  to  the  first  instantiation  of  each  SENDS  clause, 
and  use  propagation  techniques  to  assign  times  to  other  events.  The  AT  expression  of  an 
instantiation  of  an  output  of  a  node  must  at  least  equal  the  highest  of  the  sums  of  the  AT 
values  of  the  node’s  inputs  and  its  intrinsic  delay. 

The  implementation  of  our  specification  to  this  level  is: 


■  '-V- 


» 
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(Here  we  focus  only  on  the  actions  of  the  Pbf  processors,  which  are  the  ones  that  do 
the  actual  computation) 


C  Istype  CLOCK  [1 . . .  Sn] 


B'  istype  AH1LAY([1 . . .  n]) 

PV  istype  PROCESSORS  t,t  €  [l...n}  HAS  R.' 

HEARS  Pbi  (USES  R  AT  C  =  0) 
if  1  <  t  <  n  then 

LINKS  PfcJ.i,  P65+1 

(PASSES  Bj,j  €  [»  -  1 . . .  1]  AT  C  =  3(i  -  j)  +  2) 
if  t  <  n  then 

LINKS  Pbi,PVi+i 

(PASSES  Bi  AT  C  =  2) 
if  1  <  t  then 

HEARS  PJj_i  (USES  BjJ  €  (i  -  1 ...  1]  AT  C  =  3(i  -  j)) 


(in  PiJ)  : 

temp  i—  0  AT  C  =  1 
for  y  €  (i  -  1 ...  1] 

temp  ♦-  temp  +  Bj  AT  3(i  —  j)  +  l 
end  for 
Bf  *—  temp 


There  is  an  invisible  notation  on  the  nodes  whose  printed  representations  are  'Hemp  *— 
temp  +  Bj*  and  temp  «—  0  giving  them  AT  properties.  We  have  depicted  the  temp  «— 
temp  +  Bj  and  temp  0  lines  as  having  visible  AT  clauses  for  clarity. 


2.4  Fixed  Delay  Level 


A  possible  endpoint  of  a  synthesis  is  a  structure  amenable  to  VLSI  implementation. 
In  order  for  this  to  be  feasible,  several  conditions  must  be  met  that  aren’t  necessary  for  a 
parallel  structure  in  which  many  Van  Neumann  computers  cooperate.  An  example  of  thb 
is  the  fact  that  the  memory  buffering  a  link  between  two  computing  elements  must  have  a 
definite  length,  specified  when  the  circuit  is  burnt. 


29 


The  restrictions  we  must  impose,  that  things  happen  at  definite  times  and  are  separated 
by  definite  intervals,  can  best  be  modeled  by  providing  declarations  that  the  times  things 
happen  (AT  clauses)  are  relative  to  the  times  other  things  happen.  If  the  time  difference  is 
equal  to  an  explicit  constant,  a  VLSI  synthesis  system  could  use  a  shift  register  to  control 
the  timing  and  to  model  the  data  path.  If  the  time  difference  is  a  superglobal  this  is  still 
possible,  although  the  circuit  cannot  actually  be  sized  until  the  value  of  the  superglobal  is 
known. 

At  this  level,  the  internal  nodes  of  the  processors’  computation  nodes  as  well  as  the 
USES,  etc.  clauses,  except  for  the  SENDS  clauses  of  the  input  processors’  TALKS 
clauses,  have  AT  properties.  The  values  of  these  properties  are  sets  of  pairs  of  other  nodes 
and  strictly  positive  integer  constants  instead  of  an  expression.  For  each  node  n  there  is  one 
AT  property  for  each  node  m  from  which  n  receives  data  flow.  The  property  will  be  the 
pair  of  m  and  a  positive  integer  t ,  and  the  semamtics  of  this  is  that  n  finishes  its  computation 
and  makes  available  its  output  values  t  units  of  time  (clock  cycles)  after  m  does.  U  there 
is  a  list  (m,mi,m2, . . .  ,nn,n)  such  that  (mi,ti)  €  (AT  n),  (my,iy  e  (ATmy+i),  . . ., 
(m,io)  e  (AT)  then  there  is  said  to  be  an  AT-patb  from  m  to  n  with  delay  So<y<t*;* 
The  graph  whose  nodes  are  the  nodes  of  the  specification  and  whose  edges  are  the  AT>links 
is  a  DAG,  but  it  is  not  necessarily  a  tree.  Two  paths  from  m  to  n  must  have  the  same 
delay.  The  delay  of  each  node  must  be  greater  than  or  equal  to  the  intrinsic  delay  of  the 
node’s  type. 

The  motivation  for  this  level  of  description  is  that  a  specification  that  meets  these 
conditions  can  be  simply  transformed  into  suitable  input  for  a  VLSI  placement  program 
by  taking  several  steps,  assuming  that  all  nodes  of  the  specification  have  types  whose 
operator  can  be  implemented  as  a  single  object  in  VLSI.  If  this  is  not  the  case  (for  example 
if  the  specification  includes  a  multiplication  node  and  the  library  has  only  addition),  the 
offending  node  must  first  be  broken  down  into  simpler  nodes.  The  intrinsic  delay  of  each 
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so 


node  type  should  equal  the  number  of  clock  cycles  required  for  the  circuit  that  implements 
the  function  to  work.  If  (m,t)  €  (ATn)  and  the  intrinsic  delay  of  n  is  y  =  *  >-  k,  then 
fabricate  a  wire  from  the  circuit  implementing  m  to  a  shift  register  with  k  elements  whose 
output  is  connected  to  the  ^>propriate  input  port  of  n. 


It  is  more  difficult  to  display  the  implementation  of  our  specification  on  this  level, 
because  the  AT  clauses  do  not  equate  C  with  a  form  whose  free  variables  are  indices  but 
instead  with  an  object  with  two  slots;  another  node  and  an  integer  coitstant.  The  structure 
that  results  is  circular.  We  will  display  this  by  giving  names  of  the  form  (script  alphabetic:) 
to  some  nodes  so  they  can  be  referred  to  by  AT  clauses.  Again  we  will  depict  the  temp  *— 
temp  +  Bj  and  temp  0  lines  as  having  visible  AT  clauses. 


C  Istype  CLOCK  [1... fin] 

•  «  s 

&  istype  ARRAY([l...n]) 

PV  istype  PROCESSORS  i,i  €  [1 . .  .ti]  HAS  B\ 

HEARS  Phi  (A  :  USES  ft  AT  C7  =  0) 
if  1  <  t  <  n  then 

LINKS  Ph\^^ 

(3  :  PASSES  B„j  €  |i  -  1 ...  1]  AT  fy-i  +  1) 
if  I  <  n  then 

LINKS  P6„Pt;+, 

(C  :  PASSES  Bi  AT  f +  1) 
if  1  <  t  then 

HEARS  Phi_i  (P  :  USES  B„j  €  (i  -  1 ...  1]  AT  5,  +  1) 


(in  P5;.)  : 

t  :  temp  *-  0  AT  A  +  1 
for  y  e  [i  -  1 . . .  1] 

/  :  temp  *-  temp  +  Bj  AT  Pj  +  1 
end  for 
B-  temp 


2.5  Summary 


We  have  seen  that  there  are  four  useful  description  levels  for  parallel  structures. 

The  highest  level,  multiprogramming,  is  useful  when  a  processor  that  can  efficiently 
perform  context  switching  and  can  contun  memory  pools  would  not  be  objectionable. 
This  would  be  the  case  when  a  typical  microprocessor  with  sufficient  memory  could  be 
provided  for  each  processor. 

The  second  level,  single  process,  is  useful  when  a  general  purpose  processor,  but  no 
context  switching  or  memory  pool  management  mechanism,  can  be  provided.  An  example 
of  such  a  situation  would  be  the  use  of  a  typical  microprocessor  with  no  memory  in  addition 
to  its  internal  memory.  This  internal  memory  is  normally  insufficient  to  hold  several 
inactive  contexts. 

The  third  level,  clocked,  covers  situations  in  which  a  processor  simple  enough  to  do 
things  in  a  specific,  fixed  manner  and  order  is  desirable,  but  enough  logic  and  memory  can 
be  provided  to  allow  for  the  storage  and  retrieval  of  some  intermediate  values.  An  example 
of  a  technology  that  would  be  appropriate  is  interconnections  of  finite  state  machines  and 
FIFO  and  LIFO  devices. 

The  lowest  level,  fixed  delay,  must  be  reached  when  the  computation  elements  are 
restricted  to  combinatorial  logic  md  latches.  Only  fixed  time  difierences  between  the 
occurrences  of  various  events  are  allowed. 

We  therefore  have  a  series  of  synthesis  levels  and  corresponding  computation  models 
for  the  four  main  technologies  in  which  one  would  want  to  implement  a  parallel  structure. 
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Chapter  3 


Case  Studies  of  TraNSCONS  Techniques 


To  develop  the  techniques  described  above,  we  have  explored  efficient  parallel  structures 
for  several  classical  problems  and  algorithms,  described  in  the  following  Sections.  In  all 
cases  there  will  be  a  series  of  specifications,  separated  by  rules  and  prose,  describing  a 
series  of  states  of  a  node  currently  being  transformed  by  TRANSCONS.  We  will  highlight 
the  changes  with  a  vertical  stroke  (|)  in  the  left  margin,  but  will  supply  the  entire  current 
state  of  the  node  for  reference.  A  node  in  V  roughly  corresponds  to  a  syntactic  object  in 
a  syntax  tree. 

S.l  Polynomial- Time  Dynamic  Programming 

We  have  examined  a  class  of  polynomial  time  (P-time)  dynamic  programming  aJgo- 
rithms  for  which  it  is  posrible  to  synthesize  an  optimal  parallel  scheme.  The  synthesis 
uses  rules  displayed  below,  and  inference  capabilities  described  in  [Bro82].  Abstractly 
programmed  algorithms  in  this  class  include  the  Cocke- Younger-Kasami  parsing  algorithm 
for  a  fixed,  possibly  ambiguous  Chomsky  Normal  Form  grammar,  described  in  [A'U172]; 
the  Optimal  Binary  Search  IVee  algorithm,  described  in  [Knu73];  and  Optimal  Multiple 
Matrix  Multiplication,  described  in  [AHU74].  All  of  the  algorithms  fit  into  the  following 
scheme. 


'  •-  %  %  *1.  *. 
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Each  algorithm  generates  the  “solution”  to  a  problem  whose  input  is  a  sequence  S  of 
n  items  by  using  a  dynamic  programming  technique.  This  technique  generates  a  solution 
for  a  sequence  of  items  by  combining  solutions  for  contiguous  subsequences.  The  solution 
V{K)  for  a  sequence  K  of  length  n  is  found  by; 


1.  Generating  the  n~  1  possible  partitions  of  K  into  contiguous  nonempty  subsequences 
/  wd  J  such  that  /||J  =  K\ 

2.  Forming  for  each  partition  a  partial  solution  for  I\\J  by  applying  a  function  F  to 
V(/)  and  V{jy, 

3.  Obtaining  V'(/||J)  by  combining  (using  a  binary  operation  O)  all  of  the  partial 
solutions.  This  is  expressed  formally  below; 


ViK)  =  O  F{V{I),V{J)) 

I,J:Ip=K 

In  order  to  obtain  the  following  parallel  structure  and  have  it  run  in  time  0(n),  two 
conditions  must  hold; 


•  Both  0(z,  y)  and  F(x,  y)  must  take  constant  time, 

•  O  must  be  associative.  This  allows  F(V(I),V(J))  values  to  be  included  in  the 
running  Q-total  in  any  order  they  become  avulable. 
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These  conditions  are  met  by  a  sizable  class  of  problems,  e.g.,  the  problems  mentioned 
above.  The  dynamic  programming  scheme  described  above  generates  the  solution  V  {S)  for 
the  original  problem  S  at  length  n.  The  process  starts  with  V  ((s,))  for  each  s«  €  5,  then 
generates  solutions  for  subsequences  of  length  2,  3,  and  so  on,  up  to  n.  We  give  below  two 
dynamic  programming  algorithms  that  fit  into  this  scheme. 

The  Cocke- Younger>Ka8ami  algorithm  parses  a  sequence  of  terminal  symbols  according 
to  a  fixed  context  firee  grammar  in  Chomsky  Normal  Form.  This  form  specifies  that  each 
production  rule  in  the  grammar  is  either  of  the  form  N  -*  t  tat  some  nonterminal  N 
and  terminal  t,  or  N  — *  PQ  for  nonterminals  N,  P,  and  Q.  In  this  parsing  algorithm, 
each  problem  is  a  sequence  of  terminal  symbols,  T,  and  the  solution  V  (T)  is  the  set  of 
nonterminal  symbols  that  derive  T.  Let  the  initial  terminal  sequence  be  (ti . .  .tn)<  Then 
V((t,))  are  those  nonterminals  N  for  which  there  is  a  production  rule  in  the  grammar  of 
the  form  N  —*ti.  Given  two  contiguous  sequences  of  terminals  A  and  P,  the  nonterminals 
that  produce  A||P  include  those  nonterminals  N  for  which  there  is  a  rule  N  PQ  where 
P  €  V(A)  and  Q  €  V'(P).  The  nonterminals  that  produce  a  sequence  S  are  obtained  by 
dividing  the  sequence  5  into  two  subsequences  in  all  possible  ways  and  taking  the  union 
of  the  results.  In  our  formalism, 


P(V(5),  V(T))  =  {N|[N  -»  PC?]  €  G  A  P  E  V(S)  A  Q  €  V{T)} 


and 

0  is  the  union  operation,  which  is  indeed  associative. 

Another  example  of  a  dynamic  programming  algorithm  fitting  our  scheme  is  finding 
the  complexity  of  the  optimal  grouping  to  multiply  a  given  sequence  (Mi,  Mj, . . . ,  Mn)  of 
matrices.  Since  matrix  multiplication  is  associative,  roultiplybg  the  matrices  in  different 
groupings  produces  the  same  result  matrix,  but  different  groupings  may  have  different 
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execution  efficiencies.  If  M  is  &  p  X  9  matrix,  and  N  is  a  q  x  r  matrix,  then  the  product 
M  X  N  will  be  a  p  X  r  matrix,  and  the  multiplication  wiU  execute  in  time  proportional  to 
pqr  (if  a  simple  matrix  multiplication  algorithm  is  used). 


This  problem  fits  into  the  scheme  presented  above  in  the  following  fashion.  The  “so¬ 
lution”  for  each  matrix  subsequence  V ((Mi, . . . , Mj))  is  a  triple  (p, q, c):  p  is  the  row  sixe 
of  M,-;  q  the  column  size  of  Mj  (since  multiplication  using  any  grouping  of  (Mi,. . .  ,Mj) 
results  in  a  p  X  9  matrix)  and  e  is  the  optima  execution  cost  for  computing  Mi  x  ‘••x  Mj. 
The  F  for  this  algorithm  is  defined  below: 


^’((Pl.?l.Cl)i(P2.«2.C2))  =  (Pl,92,Ci  +  C2  -I-P19192) 


O  for  this  algorithm  returns  the  triple  with  the  minimum  cost  element.  (Since  only  the 
costs  can  differ  among  triples,  Q’s  choice  is  arbitrary  if  the  costs  happen  to  be  the  same.) 
The  minimum  operation  is  associative  and  commutative. 


A  high-level  specification  of  the  dynamic  programming  algorithm  is  presented  below.  A 
subsequence  can  be  represented  by  its  length  and  where  it  begins.  The  array  A  used  below 
contains  solutions  to  subsequences:  the  element  contuns  V’((t(, . . .  ,i|+„>_i)),  where  / 
is  the  initial  sequence.  The  complexity  of  each  “executable”  statement  is  presented  at  the 
right. 


The  algorithm  specification  is  as  follows: 


A  Istype  ARRAY  (/,»n),l<f<n,  l</<n-/  +  l 
V  istype  INPUT  ARRAY  (/),  1  <  I  <  n 


V/e(l...n)  ©(1) 

At,i  =  vi  e{n) 

Vme(2...fi)  ©(1) 

V/e{l...n-m+l}  ©(n) 

■^.m=  O  F{Ai,k,At+i„m-k)  ©(»*) 


Figure  3.1:  Specification  of  0(n*)  Dynamic  Programming 

F  and  Q  because  it  is  given  that  a  single  evaluation  of  both  F  and  Q  takes  constant 
time. 

The  time  complexity  of  the  specified  algorithm  is  indeed  ©(n*)  when  executed  on  a 
sequential  machine.  A  trick  is  available  for  one  of  the  problems,  Optimal  Binary  Search  T^ee 
of  [Knu73].  This  trick  involves  bounding  k  in  Figure  3.1  more  narrowly  than  {1 . . .  m- 1}. 
This  trick  reduces  the  algorithm’s  running  time  to  ©(n^),  but  it  does  not  generalize  to  the 
other  algorithms. 

It  is  possible  to  implement  the  specification  on  a  two-dimensional  array  of  ©(n*)  pro¬ 
cessors  and  the  resulting  structure  will  solve  ^-element  problem  instances  in  ©(n)  time. 
We  know  of  no  analog  to  the  trick  mentioned  above  for  parallel  structures.  The  memory 
size  of  each  processor  is  ©(n).  Below  we  describe  the  operation  of  the  structure,  and  then 
prove  that  it  is  a  ©(n)  algorithm.  This  parallel  structure  has  been  reported  in  the  literature 
[GKT79]. 

The  network  of  processors  is  displayed  in  Figure  3.2.  Observe  that  u  connected 
to  P{,m-i  uid  Pj.f  Each  processor  Pj,^  will  compute  the  value  of  A,m<  To  do  this  it 
needs  two  streams  of  information;  At,k  and  A+k,m-fti  where  k<m.  These  streams  of  data 
come  respectively  over  wires  from  processors  P|,m-i  aod  PHi.m-i-  Each  processor  P{,m 
(except  Pi,n)  will  send  every  A-value  received  from  P|,m-i  to  Pi,m-t-i  and  from  P|.«.i,m-i 


to  P|-I.m+1  “  >00“  “  P«.m  gets  it.  Each  processor  will  also  compute  F-values  and  merge 
them  into  a  running  0>total  as  soon  as  it  gets  the  necessary  A-values. 

Pi.i  Pj,i  Ps.i  P4.1 

Pl.s  Pj.J  Ps,* 

Pi, 3  P2,s 

P1.4 

Figure  3.2:  Processor  Interconnections 

At  first  glance,  it  might  appear  that  this  algorithm  has  time  complexity  6(n^).  Each 
processor  needs  to  receive  S{n)  A* values  from  each  of  its  incoming  wires;  it  must  at  some 
time  perform  0(n)  worth  of  computation  on  the  data  received  before  it  sends  its  result  on 
each  of  its  outgoing  wires.  However,  a  careful  timing  argument  shows  that  an  execution 
time  of  0{n)  can  be  achieved. 

Definition  3.1  Within  Pt,m,  for  any  k  where  I  <  k<m,  and  Ai+k^m-k  called  a 
complementary  pair  0/ A- values. 

Processor  P|,m  will  apply  F  to  each  complementary  pair  of  A-values. 

The  next  lemma  shows  that  each  processor  Pj^„i  receives  all  2m  —  2  values  it  needs, 
though  it  waits  ©(m)  for  its  first  complementary  pair,  Ajj„,/2i  and  A|+|-,R/j-|_m-rm/2l- 


Lemma  3.1  Bach  proceaaor  Pj,„i  where  l<m<n  —  m+1  reeeivea  the  valuea  where 
1  <  rn^<m  and  (aeparately)  where  \  <m*  <m,  tn  order  of  inereaaing  m*. 

Proof:  By  induction  on  m.  Clearly  this  is  true  for  Pj,],  which  receives  only  one  value 
on  each  of  its  incoming  wires.  Now  suppose  it  is  true  for  Pj,m-i  end  Pj+i,m-i-  Then  P|_^ 
will  receive  A-values  in  the  proper  order  from  Pt,m-i  P|+i,m-i  through  m'  =  m  -  2, 
following  which  it  receives  and  from  those  processors.  But  the  latter  two 

A-elements  are  just  those  required  to  preserve  the  sequences.  | 

Let  7  be  a  time-dependent  variable  such  that  at  system  startup  T  =  0,  and  after  x 
umts  of  time  T  =  x.  The  time  unit  satisfies  the  first  condition  of  the  foUowing  lemma. 

Lemma  3.2  If  all  of  the  following  eonditiona  are  met: 

•  All  of  the  following  takes  processor  P|,fit  no  more  than  one  unit  of  time:  receiving 

two  values,  one  each  from  Pj,m-i  and  Pi+i,tn-i'.  sending  these  values  on  to  Pj.m+i 
sod  P{_x_m+ii  function  F  twice  to  two  complementary  pairs  of  A-values 

if  all  values  are  available;  and  merging  the  resulting  value  into  a  running  G)*total. 

•  The  A-values  come  into  P|,m  in  the  order  indicated  by  Lemma  1.2. 

•  Each  processor  Pj,m  sends  values  received  from  P|.m-i  reap.  Pi+i,m-i  to  Pj.m+i  resp. 

oo  lAter  than  one  time  unit  after  receipt. 

•  At  7  =  0  processor  P14  transmits  Aj,i. 


then  compute  Ai^m  no  later  than  T  —  2m. 

Proof:  By  induction:  P(,i  is  initialized  to  know  Ai^i.  Now  suppose  the  lemma  is  true 
for  m  <  t  and  we  wish  to  show  it  for  m  =  t.  We  first  show  the  following  claim:  that  at 
T  =  m+j  Pi^rn  will  have  included  at  least  max(0,2(j  —  [m/ot>er2]))  P-values  in  its  running 
claim  is  proven  by  induction  on  j.  When  reading  the  proof  of  the  claim, 
keep  in  mind  that  the  “life”  of  a  processor  P|,m  is  divided  into  three  epochs: 

1.  When  T  <m,  the  processor  may  have  received  no  A-values. 

2.  When  m  <  T  <  |m,  the  processor  will  have  received  at  least  T  —  m  A-values  from 
each  of  its  input  lines.  Since  the  first  half  of  the  A-values  from  each  inbound  wire 
form  complementary  pairs  with  the  loot  half  of  the  values  from  the  other  inbound 
wire,  Pj  m  may  not  have  been  able  to  perform  any  calculations  of  any  F- values  yet. 

3.  When  T  >  |m,  the  processor  will  have  received  at  least  half  (more  accurately,  at  least 
m  — T)  of  the  values  from  each  inbound  wire.  During  each  unit  interval,  it  will  receive 
one  A-value  from  each  inbound  wire,  which  will  form  a  complementary  pair  with  some 
value  that  was  stored  from  the  other  wire  during  epoch  2.  Two  F-calculations  will 
be  possible  -  one  pairing  each  of  the  just-received  inbound  data  with  a  previously 
received  input  datum  from  the  other  side  (unless  m  is  odd  and  T  =  m  +  2^,  in 
which  case  the  two  values  arriving  at  this  time  form  a  complementary  pair). 

If  j  =  0  the  claim  requires  nothing.  If  j  >  0,  consider  the  situation  at  T  =  2(i  -  6).  All 
processors  Pj_fc  and  Pi+k,k>  where  k  <i  —  b  will  have  completed  their  work.  Their  answers 
will  have  had  time  to  reach  Pj,i  after  b  time  units,  or  at  time  T  =  2i  -  6.  But  j  =  i-  b,  so 
by  T  =  i  +  j  at  least  2j  A-values  will  have  arrived  from  each  input  connection,  and  since 
there  only  i  complementary  pairs  if  j  >  |  only  2(i  -  j)  pairs  can  be  incomplete,  meaning 
that  at  least  2{j  -  [j])  pairs  are  complete.  Since,  by  induction  on  j,  two  time  units  ago 


40 


2(y  —  f  f  1)  —  2  F-values  had  already  been  merged  into  the  running  tber«  i»  plenty  of 

time  to  merge  two  new  f-values  into  the  running  Q-total,  completing  the  induction  step 
of  the  claim. 

Lemma  3.2  follows  immediately  from  the  claim  and  the  observation  that  the  merging 
of  m  —  1  F-values  into  the  running  Q-total  in  P{,m  constitutes  a  calculation  of  Ai^m- 


Theorem  3.3  The  time  to  compute  Ai^  ia  0{n). 


Proof:  Immediate  from  Lemma  3.2  | 


A  similar  but  more  general  result  will  be  shown  in  Appendix  Section  B.  We  will  show 
how  this  parallel  structure  can  be  derived  from  the  specification  in  Figure  3.1. 


3.1.1  Preparatory  Rules 


The  problems  amenable  to  TRANSCONS  synthesis  have  internal  arrays  of  storage,  and 
the  requirement  must  be  to  fill  in  an  array  by  computing  a  value  for  each  element.  Our 
strategy  will  be  to  assign  a  processor  to  each  element  of  the  array.  The  first  preparatory 
rules,  MAKE-PSS  and  MAKE-IOPSS  declare  a  processor  family  for  each  array  of 
the  probh  and  compose  a  single  enumerated  PROCESSORS  declaration.  This  decla¬ 
ration  has  several  clauses:  the  processors  definition  clause,  the  HAS  clause,  the  HEARS 
c]ause(B),  and  the  USES  clauBe(B).  PROCESSORS  declarations  were  described  in 
Section  2.1,  but  we  will  give  a  more  complete  example  below.  Any  part  of  the  PROCES¬ 
SORS  declaration  except  the  processors  definition  clause  can  be  made  conditional. 


istype  PROCESSORS  (/,m),l<m<n,l</<n-m+l 
HAS  A.m 

if  m  =  1  then  HEARS  Q  (USES  vj) 
if  2  <  m  <  n  then 

HEARS  (USES  Aj.t,  1  <  ib  <  m  -  1) 

HEARS  (USES  Ai+t^m-k,  1  <  A:  <  m  -  1) 

ifl<m<n  —  1  then 

TALKS  (SENDS  Aj.k,  1  <  it  <  m  +  1) 

ifl<m<n— 1a/>2  then 

TALKS  (SENDS  At-k,m+k,^  <  k  <  tn  +  1) 

ifl<m<n  —  lAl<m<n-t  then 

LINKS 

,m+l 

(PASSES  A.k,  1  <  fc  <  m  -  1) 
ifl<m<n  —  lAl<m<n  —  lA/>2  then 

I-INKS  Pj+l,m-l.^*i-l,m+l 

(PASSES  A-k.m+k,  1  <  fc  <  m  -  1) 


This  declaration  means  all  of  the  following: 


•  A  family  of  processors  exists.  The  family  name  is  P.  Each  member  of  the  family  is 
named  by  two  indices,  and  tmy  member  exists  ifl<m<nAl</<n-m  +  l. 
The  value  n  is  an  externally  defined  constant  value  (for  any  instance  of  the  problem) 
defining  the  problem  size.  This  PROCESSORS  declaration  actually  declares  some 
facts  about  every  processor  in  the  family. 

•  Each  element,  Pj,m)  of  this  family  is  responsible  for  computing  the  value  of  (i.e., 
HAS)  Ai^m-  A  is  an  array  declared  elsewhere  in  the  specification  that  contains  the 
PROCESSORS  declaration. 

•  If  P(,i  is  defined  it  needs  vi  to  compute  its  HAS  values,  and  it  expects  to  get  these 
values  from  (i.e.,  HEARS)  the  (only)  processor  in  the  Q  family. 

•  If  P|^„i  is  defined  and  2  <  m  <  n,  then  Pj^^  needs  the  values  of  A/^t  for  any  k, 
1  <  k  <  m-  1.  It  also  needs  Ai+t^m-k  for  any  k  in  that  range.  It  expects  to  get  these 


values  from  processors  in  the  P  family,  namely  Pj,m-i  Pi+i,m-i-  scope  of  the 
bound  variables  list  (in  this  case,  *'/,m’’)  is  the  entire  PROCESSORS  declaration. 


•  Similarly,  processors  whose  indices  meet  certain  conditions  TALK  to  other  processors 
and  SEND  certain  values  as  specified  by  the  enumerated  expressions.  Processors 
are  also  declared  as  LINKing  pairs  of  other  processors  and  PASSing  sets  of  values. 


The  TALKS/SENDS  and  LINKS/PASSES  information  is  redundant;  this  infor¬ 
mation  can  be  inferred  from  the  HEARS/USES  data.  In  what  follows,  I  will  omit  this 
redundant  information  to  enhance  readability  except  where  I  judge  it  to  be  critical  to  an 
understanding  of  the  declaration. 


3.1.1.1  Rule  MAKE-PSSi  Give  Each  Non-I/O  Array  Element  its  Own  Pro¬ 
cessor 


Dy  our  conventions,  the  portion  before  the  is  the  antecedent  and  the  rest  is  the 
consequent.  Variables  free  in  the  antecedent  are  implicitly  existentiaJly  quantified  and  the 
scope  of  this  quantification  is  the  entire  rule.  Variables  free  only  in  the  consequent  are 
universally  quantified  (but  this  is  rare).  A  rule  is  said  to  apply  if  the  antecedent  is  true; 
when  this  happens  the  semantics  of  the  rule  is  to  make  the  consequent  true.  It  is  explicitly 
permissible  for  the  consequent  to  make  the  antecedent  no  longer  true. 


rule  MAKE-PSa  (**)  TRANSFORM 
**  =  ‘bind  NAME  Istype  X' 

A = ‘AJR.RAY(/,m),l  <  m<  n,l<  /  <  n-m+1’ 

A  undefined  (70  X) 
aY  =  (gensym  TROC) 

aZ=  ^PROCESSORS  (BOUND)  ENUMERS  HAS  NAMEbovsd’ 

**  =  'bind  ...  Y  istype  Z' 

MAKB—PSa  applied  to  Figure  1  binds  as  follows: 

bindings: 

♦♦=  ((entire  apeeification)) 

=  ‘  ARRAY  (/,  m),  1  <  m  <  n,  1  <  /  <  n  —  m  +  1’ 
NAME=  ‘A' 

BOUND=  H,m' 

ENUMERS=‘l  <m<n,l<l<n-m  +  r 

y=  ‘P’ 

Z=  ‘  PROCESSORS  (l,m),l  <m<n,l<l<n-m  +  l 

has 


obtuning 

A  Istype  ARRAY  (I,  m),  l<m<n,l<l<n-m  +  l 
I  P  istype  PROCESSORS  (/,m),  l<m<n,  l</<n-m  +  l  HAS 

V  istype  INPUT  ARRAY  (/),!</<« 

O  istype  OUTPUT  ARRAY 
V/€{(l...n» 

•A|,i  =  wi 

V  m  €  {(2...n)) 

V/e  -  m  +  1} 

P{Al^lnAl+i[,m-k) 


as  the  new  state  of  the  database. 


S.1.1.2  Rule  MAKE—IOPSSi  Assign  I/O  Arrays  to  Processors 


This  rule  assigns  a  tinglt  processor  to  each  input  or  output  array.  The  reason  only  a 
single  processor  is  assigned  is  that  it  is  assumed  that  input  values  will  reside  in  a  single 
entity,  such  as  a  tape  drive. 


rule  MAKE-PSt  (••)  TRANSFORM 
**  =  ‘bind  NAME  istype  X' 

Ajr'ARRAY(/,m),l  <  m<  n,  1<  /<  n  — m+1* 

A  defined  (70  X) 

A  y  =  (gensym  *PROC) 

A  Z  =  'PROCESSORS  HAS  NAMEbovsd' 

**  =  liind  ...  Y  istype  E' 

Rules  MAKE—PSs  and  MAKE—IOPS$  make  PROCESSORS  declarations  that 
do  not  have  USES  and  HEARS  clauses  yet.  The  next  rule  fills  in  those  clauses,  and 
subsequent  rules  improve  them. 


Rule  MAKE—IOPSb  applies  for  two  sets  of  bindings: 


•*  =  {(entire  Bpecifieation)} 

X  =  ‘  OUTPUT  ARRAY  O’ 
JO  =  ’  OUTPUT 
NAME  =  0 

BOUND  =  {empty  binding  Hat) 
ENVMERS  =  {empty  binding  Hat) 

y  =  R 

Z  =  ‘  PROCESSORS  R 
HAS  O’ 


**  =  ((entire  speet/t cation)) 

X  =  ‘  INPUT  ARRAY  v/,1  <  /  <  n 
70  =  ’ INPUT 
NAME  =  V 
BOUND  =  •/’ 

ENUMERS  =  ‘1  <  /  <  n’ 
y  =Q 

Z  =  *  PROCESSORS  Q 

HAS  wj,l  <  /  <  n’ 


resulting  in 

{P.\)  A  Istype  ARRAY  (/,iTi),l  <  m  <  n,l  </<  n  -  m  + 1 

P  Istype  PROCESSORS  (/,m),  l<m<n,l</<n-m+l 

HAS 

V  istype  INPUT  ARRAY  (/),  1  <  /  <  n 
I  Q  Istype  PROCESSORS  HASv{,l<l<n 
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O  istype  OUTPUT  ARRAY 
R  istype  PROCESSORS  HAS  O 


V/G(l...n) 

m 

{P.la) 

A|,i  =  wj 

Hn) 

V  mG  (2...n) 

e{i) 

V/G{l...n  —  m-Hl} 

e{n) 

(P.16) 

=  O  P’(Aj.fc,Aj+t,m-fc) 

9{n^) 

(P.lc) 

O  =  Ai,n 

e[i) 

So  far,  all  rule  application  can  be  done  in  a  straightforward  manner,  without  inference. 


3. 1.1.3  Rule  MAKE-USES— HEARS:  Determine  Processors*  Inputs 


We  need  rules  to  describe  the  connections  between  processors  and  the  data  that  pro¬ 
cessors  need  to  produce  results.  This  rule  is  very  conservative  -  it  determines  what  array 
values  each  processor  P*  needs,  and  it  specifies  a  direct  connection  from  the  processors 
holding  those  values  to  P*.  The  USES  clause  describes  the  values  that  a  processor  needs; 
the  HEARS  clause  describes  the  processors  that  have  (HAS)  these  values. 

To  determine  this,  consider  the  innermost  loop  which  assigns  values  to  array  elements 
indexed  by  non-region-constants.  Note  that  the  form  of  the  rule  shown  below  evidences  a 
need  for  elaborate  flow  analysis.  Non-constant  array  index  expressions  are  used  as  processor 
indices.  The  indices  for  those  array  elements  whose  values  can  affect  the  assigned  value 
comprise  the  index  expressions  for  the  USES  and  HEARS  sets.  A  reference  at  the  same 
loop  level  will  normally  generate  USES  and  HEARS  clauses  with  null  enumerations.  A 
reference  contained  in  a  deeper  loop  will  normally  generate  instances  of  such  clauses  with 
inherited  enumerators  from  the  loops. 


rvle  MAK E-US ES-H EARS  (••)  TRANSFOBM 
CB  =  ‘bInd’  •• 

A**  = 

*PDCL  istype  PROCESSOBS  {PBV)  PENUMER  HAS  ANAMEbindex' 
aX  =  ( INNER-LOOP-THAT-DEFINES  ANAME  CB) 

AYe  (ARRAY-REFERENCES- AFFECTING  X) 

AZ  =  i EFFECTIVE-ENUMERATOR-OF  Y  X) 

A  fF.  CONDITIONS  = 

CB.  CONDITIONS  U(INFERRED-C0NDITI0NS  X) 

AW'.  CLASS  =  USES-CLAUSE 
aW.ARG  = 'ANAME 

(REL-BV  PBV  X.DEF-OF  .INDEX-EXPR  Y  Z) 

(RELENUMER  PBV  X.  DEF-OF  .  INDEX-EXPR  Y  Z)' 

A  Q.  CONDITIONS  = 

CB.  CONDITIONS  U(INFERRED-C0NDITI0NS  X) 

A  Q.  CLASS  =  HEARS-CLAUSE 
A  HISBV  =  ANAME.  PROCSTMT  .  PROC-BV-OF 
aQ.ABG  = ‘AATAME.  PROC-OF 

(BEL-BV  HISBV 

X.  DEF-OF  .  INDEX-EXPR  Y  Z) 

( RELEI'^UMER  HISBV  X.  DEF-OF  .  INDEX-EXPR  Y  Zy 


W  €  «'*.clause8 
aQ  €**.  clauses 


The  INNER-LOOP-THAT-DEFINES  function  finds  an  innermost  locality  where 
an  element  from  the  argument  array  is  defined  (not  merely  used).  The  ARRAY- 
REFERENCES- AFFECTING  function  returns  a  set  of  all  >>oints  in  the  program  where 
an  array  is  referenced  and  the  value  returned  can  affect  the  results  of  its  operand,  a  pro¬ 
gram  point.  The  EFFECTIVE-ENUMERATOR-OF  function  determines  what  (pos¬ 
sibly  implicit)  enumerators  its  first  argument  (an  array  reference)  is  controlled  by,  beyond 
the  enumerators  that  control  its  second  argument  (an  array  definition  in  this  case). 


The  map,  z.CONDITIONS,  allows  any  node  z  to  be  placed  under  the  influence  of 
conditions  (an  If  clause).  INFERRED-CONDITIONS  is  a  function  that  produces  an  if 
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clause  that  specifies  exactly  those  conditions  that  must  be  true  for  the  point  representing 
the  argument  to  be  reached  (a  form  of  assertion  propagation). 

REL-BV  and  RELENUMER  give  a  piece  of  text  that  respectively  will  serve  as  a 
bound  variable  and  an  enumerator  for  the  fragment  enumerated  by  the  fourth  argument 
to  be  valid  for  the  third  argument  in  the  context  of  the  second  argument,  using  the  bound 
variables  of  the  first  argument.  This  would  be  the  bound  variables  of  the  fourth  argument 
unless  there  is  a  variable  name  clash. 


a' 


This  modifies  the  first  PROCESSORS  type  declaration,  which  becomes 


P  istype  PROCESSORS  (l,m),l  <  m  <  n,l  <  I  <  n  -  m+ 1 
HAS 

1  if  m  =  1  then  HEARS  Q  (USES  wj) 


Application  to  the  assignment  to  in  produces 


% 


i 


P  istype  PROCESSORS  (1,  m),  l<m<n,  l</<n-m+l 
HAS 

if  m  =  1  then  HEARS  Q  (USES  V{) 

I  if  2  <  m  <  n  then 

1  HEARS  P,,fc,  1  <  fc  <  m  (USES  A,.*,  1  <  fc  <  m) 

I  HEARS  Pj+fc,m-fci  1  <  A:  <  m  (USES  1  <  k  <  m) 


Finally,  apply  MAKE-USES— HEARS  one  last  time,  to  the  null  “enumeration”, 
(F.lc),  that  sends  the  output  value  to  the  output  “array”,  O.  This  forces  us  to  modify  R’s 
PROCESSORS  declaration  as  follows; 


( 


I _ 


R  btype  PROCESSORS  HAS  O 

I  HEARS  Pi^  (USES  Ai,n) 

This  declaration  is  in  ita  final  form. 

The  applications  of  MAKE— USES -HEARS  require  flow  analysis  and  some  ability 
to  reason  about  enumeration  (to  construct  If  clauses). 

3.1.2  Optimization  Rules 

The  rest  of  the  rules  described  in  this  section  will  transform  the  simplest  parallel  struc¬ 
tures  into  more  efficient  ones.  They  do  this  by  detecting  and  removing  redundant  inter¬ 
connections. 

S.1.2.1  Rule  REDUCE-HEARSi  Improve  HEARS  clauses 

It  may  be  that  a  HEARS  clause  of  a  PROCESSORS  declaration  requires  each 
processor  to  be  connected  to  an  asymptotically  growing  number  of  other  processors.  This 
is  imdesirable,  because  the  number  of  interconnections  in  the  whole  collection  of  processors 
would  grow  faster  than  the  number  of  processors,  and  the  cost  of  interconnections  would 
exceed  the  cost  of  processors  for  sufficiently  large  problems.  This  would,  in  turn,  decrease 
the  size  of  the  largest  problem  that  could  be  handled  by  a  given  parallel  structure. 

However,  often  it  is  not  necessary  for  each  processor  to  be  connected  to  all  other 
processors  whose  values  it  needs.  If  processor  ?•  needs  values  from  processors  and  Pe, 
but  Pi  needs  a  value  from  processor  P^,  it  may  not  be  necessary  for  Pa  to  be  connected  to 
Pe.  Pa  must  be  connected  to  Pt,  but  P*  will  be  able  to  get  the  value  that  Pa  wants  from 
Pe,  so  it  (P))  can  pass  that  datum  along. 

This  form  of  this  observation  only  secures  a  constant  factor  reduction  in  the  number  of 
interconnections  (in  this  case,  from  two  to  one),  but  it  is  possible  to  do  better  by  extending 


the  principle.  Suppose,  for  example,  that  a  structure  includes  a  family  of  processors  P,-  for 
1  <  t  <  n.  Further  suppose  that  Vi,y  where  j  <  t,  P,-  needs  values  from  Py.  In  this  case, 
P,>1  will  need  all  the  values  P,-  needs,  plus  the  value  in  Py  itself. 

Basic  Observation  3.1  If  Py  is  capable  of  supplying  all  of  the  information  that 
needs,  so  it  is  possible  to  modify  the  structure  to  replace  the  S{n)  connections  required  by 
this  HEARS  clause  by  a  single  connection. 

Definition  3.2  In  a  parallel  structure,  a  lamily  of  processors  is  the  set  of  processors  de¬ 
fined  by  a  single  PROCESSORS  declaration  when  enumerated  over  the  PROCESSORS 
clause’s  enumerator.  That  family  is  generated  by  that  PROCESSORS  declaration. 

Definition  3.3  The  set  of  processors  in  a  processor  Pa  ’*  family  HE  ARd  by  Pa  due  to  a 
HEARS  clause  Ho  will  be  written  JEro(Pa)>  And  is  called  the  induced  set  of  Hq  at  Pa- 

Definition  3.4  Consider  Ro(Pa)  And  Ho{Pt).  Suppose  that  each  is  a  subset  of  the  same 
family  as  Pa  and  Pt  (which  are  in  the  same  family  because  they  both  have  the  same  HEARS 
clause.  Ho).  The  interconnections  defined  by  Ho  telescope  if  these  sets  Ho{Pa)  and 
ffo(P»)  either  are  disjoint  or  one  strictly  contains  the  other,  for  any  choice  of  Pa  and  Pj  in 
the  family.  We  also  say  that  Ho  telescopes.  If^PaJ>i^jamiiv  '■  (0  C  Ro(Po)  C  ffo(Pt)  ^ 
^Pcejamily  :  [^o(Pa)  U{Po}  =  ^fo(P«)))  Ihen  Ho  snowballs.  The  notion  of  oUSES  clause 
telescoping  is  defined  similarly.  A  partition  is  induced  by  a  telescoping  clause  cq  if  two 
processors  are  in  the  same  partition  whenever  the  sets  defined  by  co  overlap. 

Theorem  3.4  If  a  HEARS  clause  Ho  snowballs,  it  can  be  replaced  by  another  HEARS 
clause  that  only  specifies  input  from  a  single  processor. 
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Proof:  Consider  the  family  of  processors  described  by  the  PROCESSORS  declaration 
that  contains  the  HEARS  clause.  Consider  also  the  induced  partition  11.  If  the  cardinality 
of  an  equivalence  class  E  €  11  is  (say)  e,  then  VP.  C  E  ;  ||  J7o(Ps)||  <  c.  (No  processor  can 
HEAR  itself  because  it  would  never  be  able  to  complete  its  calculation  if  it  needed  its  own 
result  to  do  so.)  Since  Vz,y  :  z  9^  y  =>  ||Ro(Pc)||  #  ll'^o(Ps)||>  uid  since  ||{0.  ..e- 1}||  =  e, 
the  processors  in  E  can  be  completely  ordered  by  the  cardinalities  of  their  HEARd  sets.  By 
the  basic  observation  and  the  snowballing  properly,  each  processor  can  get  the  information 
that  Ho  requires  from  the  processor  that  is  its  predecessor  in  this  ordering.  | 


Definition  3.5  We  call  replacing  a  HEARS  etanae,  as  in  the  previous  theorem,  reducing 
Uie  clause. 


The  following  is  proven  in  Appendix  Section  B. 


Theorem  3.5  Reducing  a  snowballing  HEARS  clause  will  produce  a  parallel  structure 
whose  asymptotic  speed  is  the  same  as  the  speed  of  the  original  structure. 


We  can  now  state  this  rule  in  English  as  follows;  ‘If  a  HEARS  clause  snowballs  then 
reduce  it”,  and  more  formally  as  follows: 
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rule  REDUCE-HEARS  {atmt)  TRANSFORM 

atmt  =  ‘PNAME  ktype  PROCESSORS  {$PDV)  tPENUMER . . . 
if  CONDI  then 

HEARS  PNAMEiHBv  $HBNUMER 
( USES  UV^uBvtUEN, 

A  (theorem 

(/SI  =  {HBV  :  HENUMER[PDV  \PDVi]} 

A  ISla  =  {HBV  :  HENUMER[PDV\  PDV2]} 

A  IS2  =  {PDV  :  HENUMER  A  HBV  =  PDV} 

A  PROCl  =  {PDVi} 

A  PROC2  =  {P/)V} 

A  PROCh  =  [HEXPR] 

A  PROChi  =  {HIEXPR} 

A  (/SI  n  /51a)  e  {0  /SI  /Sla} 

A((0C  /Sic  /SlaACOiVDl) 

=►  /SlUP/?^^!  =  /S2) 

A  {COND2  ^  COiV/Jl  A  /SllJPiJOCli  =  IS2) 

A  i  ~  CON D2=>  3 PDVallSl  C  {HBV  : 

HENUMER[PDV  XPDVs]}]) 

A  (COND3  w  COND2  A  CONDl[PDV\  HIEXPR]) 

A  (P/EA’PP[P/)1^  \  HEXPR]  =  PW))) 

«tmt  =  'PNAME  istype  PROCESSORS  {$PDV)  tPENUMER  . . . 

iSCONDZ  then 

LINKS  PN AME%iiexpr>  ^MP%hiexpr 
( PASSES  UV^ubv^UEN,  . . .) 

]iCOND2  then 

HEARS  PN  AM  EjfEXPR  ■  •  •  * 


when  this  rule  is  applied  to  the  current  state,  the  bindings  will  be  as  follows: 
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**=  ‘  PROCESSORS  Pi,,w,  l<m<n,  l</<n-ni-|-l 


M  K  ARS 

PNAME=  ‘P’ 

PDV='l,m' 

PENU  MER—  tn  +  1* 

HBV=‘l  +  k,m-k' 

HENUMER=  ‘1  <  fc  <  m  -  1’ 

SEjTI—  {(/i  +  A;,  mi  ~  k)  <k  ^  mi  —  1} 
SETla=  {(/j  +  Jb.mj  —  A:)  :  1  <  Jb  <  mj  —  1} 
PROCl:=  {(/i.mi)} 

SET2=  {{/  +  ib,  m  -  ib)  :  1  <  ifc  <  m  -  1} 
PROC2=  {(/,m)} 

PROCh=  {(/  +  l,m-l)} 

HEXPR=  ‘(/  +  l,m-l)’ 

CONDl=  ‘2  <  m  <  n’ 

COND2=  true 


theorem  is  a  function  whose  argument  is  a  symbolic  set-theoretic  expression 
whose  atomic  terms  are  set  expressions.  These  expressions  are  principally  created  by  the 
BOUNDS  Y  function,  whose  inputs  are  the  bound  variables  list  of  the  processor  name  id, 
an  identity  parameter,  the  form  that  defines  the  array  references  that  comprise  the  array 
definition,  and  the  enumerator  (if  any)  for  the  array  reference. 

This  rule  reduces  the  HEARS  clauses  from  the  large  PROCESSORS  declaration 
of  the  current  state  to 


HEARS  Pj,m-i 
HEARS  Pi+i,m-i 


The  resulting  PROCESSORS  declaration  is 


P  Utype  PROCESSORS  (/,m),l<m<n,l</<n-m  +  6l 
HAS  Ai^m 

if  m  =  1  then  HEARS  Q  (USES  i/j) 
if  2  <  m  <  n  then 

HEARS  (USES  <  ik  <  m) 

HEARS  P|+i,m-t  (USES  <k<m) 


Figure  3.3:  Final  Form  of  Main  Processors  Declaration  in  P-time  Dynamic  Programming 
Derivation 


3.1.2.2  Rule  AS:  Write  the  Individual  Processors*  Programs 

The  general  idea  of  the  rule  is  that  the  first  rule  isolated  the  deepest  enumeration  in 
the  specification  which  assigned  a  value  to  an  array  element,  and  built  the  beginnings  of  a 
parallel  structure  where  each  array  element  within  the  domain  of  that  enumeration  had  its 
own  private  processor.  Since  the  enumeration  in  time  has  been  replaced  by  an  enumeration 
in  space,  the  layers  of  enumeration  that  get  us  to  the  point  which  induces  the  creation  of 
the  first  parallel  structure  can  be  stripped  away. 

A  technical  note  is  that  the  enumerations  can  only  be  completely  discarded  when  there 
is  no  calculation  at  intermediate  levels.  If  there  is  such  calculation,  the  system  will  have 
to  add  it  to  the  appropriate  processors  when  it  strips  away  the  layers  of  enumeration 
that  include  such  calculation  as  well  as  the  deeper  enumeration.  This  does  not  make 
the  asymptotic  behavior  of  the  parallel  structure  any  slower  except  when  the  calculations 
include  enumerations.  When  this  is  the  case,  it  might  be  possible  to  respecify  the  problem 
to  have  separate  copies  of  the  array  enumerated  in  the  calculation  for  each  cell  of  the  target 
array.  This  would  require  an  array  whose  dimension  is  the  sum  of  the  dimensionalities  of 
the  two  arrays. 


This  rule  is  as  follows:  "Supply  each  processor  specified  by  a  PROCESSORS  decla^ 
ration  with  a  copy  of  those  enumerations  from  the  original  program  that  occurred  within 
the  region  that  included  the  assignment  to  array  elements  that  generated  that  PROCES¬ 
SORS  declaration.  The  references  to  array  elements  are  replaced  by  associative  lookups 
from  the  table  of  information  that  the  processor  has  HEARd.  The  outer  enumerations 
are  stripped  from  the  program,  and  uses  of  the  variables  that  were  bound  in  these  outer 
enumerations  are  replaced  by  constants  refiecting  the  processor’s  ID.” 


The  derivation  of  the  ^-time  dynamic  programming  parallel  structure  is  almost  com¬ 
plete.  It  remains  only  to  reduce  the  depth  of  enumeration  to  the  single  level  implicit  in 
the  segment, 


Rule  AS  does  this.  The  complete  parallel  structure  that  results  is  as  follows: 


A  istype  ARRAY  (/,m),l<m<n,  l</<n  —  m-|-l 
P  istype  PROCESSORS  (l,m),l  <m<n,l<I<n-m-|-l 
HAS  A,.m 

if  m  =  1  then  HEARS  Q  (USES  vj) 
if  2  <  m  <  n  then 

HEARS  (USES  A,*,  I  ^  ^  S  rn) 

HEARS  (USES  <  »n) 

V  istype  INPUT  ARRAY  (/),  1  <  /  <  n 
Q  istype  PROCESSORS  HAS  vj,l  <  /  <  n 
O  istype  OUTPUT  ARRAY 
R  istype  PROCESSORS  HAS  O 
(in  Pj,i,l  <  /  <  n) : 

A|,i  *-  V| 

(in  Pi.m,  2  <m<n,l</<n-m-|-l): 

A|,m  ♦-  0*e{l...  m-l) 

(in  Pi.n)  : 


®(1) 

f(n) 


3.1.2.3  Rules  HEAR-BY -CHAIN y  SENDS-BY -CHAIN-.  Improve  Topology 


of  Input /Output 

We  see  that  the  rules  described  so  far  will  produce  a  parallel  structure  in  which  every 
processor  is  directly  connected  to  the  input  and  output  processors  when  given  a  specifi¬ 
cation  of  array  multiplication.  Only  one  I/O  processor  is  created  per  I/O  array,  and  for 
many  problems,  including  array  multiplication,  it  is  necessary  to  get  some  input  or  output 
from/to  every  processor,  (^-time  dynamic  programming  is  an  exception,  in  which  only 
6[n)  of  the  6{n^)  processors  receive  input  values  and  the  output  is  only  a  single  value.) 

We  therefore  conceived  another  rule  to  attempt  to  reduce  the  excessive  connectivity 
that  results  from  every  processor  needing  access  to  input  or  output. 

If  the  following  conditions  are  met: 


•  the  number  of  processors  n,-  in  a  family  that  receives  input  from  or  sends  output  to 
a  given  processor  is  asymptotically  unacceptable,  and 

•  there  is  a  HEARS  clause  Hq  such  that  the  number  of  processors  that  do  not  HEAR 
any  processor  using  Hq  clause  (if  input)  or  that  are  not  HEARd  by  any  processor 
using  that  clause  (if  output)  is  asymptotically  less  than  n^. 


then  the  I/O  HEARS  clauses  can  be  reduced  so  that  only  those  processors  at  a  source 
(or  terminus  if  output)  of  Hq  are  directly  connected  to  the  I/O  processor. 

Pictorially,  we  convert  structures  that  look  like  these: 


Figure  3.4:  Many  Processors  Use  or  Build  the  Same  Data 


Figure  3.5:  Resulting  Structure  fVom  Sharing  I/O  Connections 


We  are  not,  however,  prepared  to  do  this  yet.  We  need  to  have  the  chains  of  processors 
required  by  this  rule  in  order  to  improve  the  connections  to  the  I/O  processors.  For  this 
we  must  introduce  another  definition  and  another  rule. 

3.1.2.4  Rule  MAKE-CHAIN:  Create  Interconnections  in  a  Family  to  Reduce 
I/O  Connectivity 

Rules  HEAR-BY -CHAIN,  etc.  allow  the  reduction  of  connections  firom/to  an  I/O 
processor  where  a  set  of  interconnections  already  exists  to  solve  the  I/O-free  portion  of  the 
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problem.  In  some  problems,  including  array  multiplication,  no  convenient  set  of  intercon¬ 
nections  exists  and  one  must  be  introduced  solely  to  distribute  I/O  values.  Fortunately, 
the  rule  that  would  do  this  is  fairly  simple  to  state  and  is  evidently  implementable,  given 
the  mechanisms  already  required  for  REDUCE- HEARS.  First  we  extend  Definition  3.7 
of  induced  sets  to  USEd  values  (the  original  definition  covers  only  HEARd  processors, 
but  it  extends  in  the  obvious  manner  to  USEd,  SENt  and  PASSed  variables  as  well  as 
TALKed  and  PASSed  to  processors.).  We  then  define  the  notion  of  telescoping,  which 
heuristically  describes  a  situation  in  which  a  set  of  processors  can  be  split  into  subsets  such 
that  the  subsets  share  interest  in  a  restricted  portion  of  the  I/O  data. 

Definition  3.6  A  clause  telescopes  if  its  induced  sets  for  two  processors  are  either  dis¬ 
joint,  or  one  contains  the  other.  Whenever  a  clause  telescopes,  that  clause  defines  a  par- 
tition,  the  induced  partition,  where  two  processors  are  in  the  same  partition  element  iff 
the  induced  sets  for  the  two  processors  have  a  non-empty  intersection.  (Without  loss  of 
generality  we  consider  only  non-empty  sets.  Empty  sets  are  not  considered  because  they 
impose  no  interconnection  requirements.) 

The  rule  is:  where  a  single  USES  clause  telescopes,  order  the  induced  partition  by  the 
processor  indices  and  interconnect  the  processors  in  each  partition  with  a  new  HEARS 
clause  where  each  processor  is  connected  (only)  to  its  immediate  predecessor  (if  any)  in 
this  ordering.  Place  the  USES  clause  within  the  new  HEARS  clause  instead  of  within 
the  old  one. 


3.2  A  Derivation  of  a  Fast,  Parallel  Matrix  Multiplication  Structure 

Meiny  parallel  structures  for  the  matrix  multiplication  problem  have  been  proposed 
in  the  literature,  probably  because  of  the  problem’s  practical  importance  and  its  obvious 
suitability  for  parallel  processing.  One  of  the  prettiest  parallel  structures  is  described 
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in  [Kiil<76].  Kung’s  structure  multiplies  two  n  x  n  matrices  in  6(n)  time  using  6(n*) 
processors  of  constant  size.  (Kung  makes  the  assumption  that  a  solution  that  involves 
0(n)  processors  in  communication  with  the  outside  world  is  acceptable.  This  subsection 
follows  that  assumption,  which  is  obviously  necessary  to  multiply  n  x  n  matrices  in  0(n) 
time.)  There  are  sequential  algorithms  with  sub-cubic  execution  time,  but  there  are  no 
obvious  mappings  of  these  algorithms  into  parallel  structures. 

Some  new  techniques  must  be  introduced  to  derive  the  systolic  array  of  [KuL76].  This 
will  be  the  subject  of  the  next  Section.  It  is,  however,  possible  to  derive  a  different  parallel 
structure  with  linear  execution  time.  We  added  rule  MAKE— CHAIN  with  this  derivation 
in  mind,  but  do  not  feel  that  MAKE— CHAIN  is  contrived  or  impractical. 

Our  parallel  structure  uses  more  processors  than  a  systolic  array  on  a  restricted  class 
of  matrices  called  ‘band  matrices,*  in  which  ail  but  a  narrow  diagonal  band  of  the  input 
matrices  (and  therefore  of  the  output  matrices)  contains  zero  values.  It  does,  however,  use 
fewer  processors  on  general  matrices. 

The  starting  point  of  this  derivation  is  a  specification  of  matrix  multiplication  (we  are 
assuming  square  matrices  to  simplify  the  discussion): 

A  Istype  INPUT  ARRAY  (l,m),  1  <  I  <  n,l  <  m  <  n 
B  Istype  INPUT  ARRAY  (/,  m),  l</<n,  l<m<n 
C  istype  ARRAY  (/,m),l  </<n,l<m<n 
P  istype  OUTPUT  ARRAY  (l,m),  l<I<n,l<m<n 


Vi€{l...n}  e(l) 

V  j  e  (1 . . .  n}  0(n) 

E  ©(n») 

S€{l...n) 

Dij^Cij  ©(n») 


The  use  of  arrays  C  and  D  seems  redundant,  but  its  purpose  is  technical  -  our  rules 
would  not  permit  us  to  assign  multiple  processors  to  a  single  array  if  that  array  were  an 
INPUT  or  OUTPUT  array.  Duplicating  all  of  the  arrays  in  this  manner,  to  avoid  all 


FV'  f!  J  -"J".','  I". 


c 


e 

!<' 

I® 


appearances  of  "prejudicing  the  case”  of  which  array’s  parallelism  would  be  important, 
would  change  the  resulting  parallel  structure  only  py  replacing  each  processor  by  a  set  of 
three  processors. 


MAKE-PSa  and  MAKE-IOPSa  add  PROCESSORS  declarations, 

A  istype  INPUT  ARRAY  (l,m),l  </<n,l<m<n 
I  PA  istype  PROCESSORS  HAS  Ai^mt  l</<n,l<m<n 
B  istype  ARRAY  (/,m),  l</<n,  l<m<n  INPUT 
1  PB  istype  PROCESSORS  HAS  Bi,m.  l</<n,  l<m<n 
C  istype  ARRAY  (/, m),  1  <  /  <  n,  1  <  m  <  n 
I  PC  istype  PROCESSORS  (l,m),l  <  t  <  n,l  <  m  <  n  HAS  Cj,m 


D  istype  OUTPUT  ARRAY  (/,m),  l</<n,  l<m<n 
1  PD  istype  PROCESSORS  HAS  A.»,l  </<n,l<m<n 


Vie{i...n} 

©(1) 

Vie  {l...n} 

©(n) 

©(n>) 

A,y  =  Cij 

©(n») 

•  .  J 


I' 


MAKE-USES- HEARS  completes  the  rough  form  of  these  declarations. 


A  istype  ARRAY  (/,m), I  <  I  <  n,l  <  m  <  n  INPUT 
PA  istype  PROCESSORS  HAS  l</<n,l<m<n 
B  istype  ARRAY  (/,m),  l</<n,  l<m<n  INPUT 
PB  istype  PROCESSORS  HAS  </<n,l<m<n 

C  istype  ARRAY  (/,m),  l<l<n,  X<m<n 
PC  istype  PROCESSORS  (/,m),l  </<n,  l<m<n  HAS  Cj,„ 
HEARS  PA(  USES  y4|.k,  1  <  fc  <  n) 

HEARS  PB(  USES  <  Jk  <  n) 

D  istype  OUTPUT  ARRAY  (/,m),l  <  I  <  n,l  <  m  <  n 
PD  istype  PROCESSORS  HAS  </<n,  l<m<n 

HEARS  PC|,TO.  l<l<n,l<m<n 

( USES  Cj,TO>  l<m<n) 

Vie  {l...n} 

Vi€{l...n} 

Ci 

fce{l...n} 

Aj  =  Cij 


'tj  —  A,j 


©(1) 

©(n) 

©(n») 

©(n*) 


• •  •  • • 


REDUCE— HEARS  ia  unable  to  improve  this  parallel  structure,  because  there  are  no 
interconnections  among  the  PCs  to  improve.  Rule  HEAR— BY— CHAIN  is  also  helpless, 
although  the  topology  of  the  interconnection  graph  is  too  rich  (0(n’)  rather  than  the  goal 
of  0(n)).  Rule  MAKE— CHAIN  comes  to  the  rescue.  Adding  the  HEARS  clauses 
allowed  by  MAKE— CHAIN  and  by  the  USES  clauses  of  PC  produces: 

A  Istype  ARRAY  (l,m),l  <  I  <  n,l  fern  <  n  INPUT 
PA  istype  PROCESSORS  HAS  Ai^^,  l<Z<n,l<m<n 
B  istype  ARRAY  (/,m),l  <  /  <  n,l  <  m  <  n  INPUT 
PB  istype  PROCESSORS  HAS  l</<n,l<m<n 

C  istype  ARRAY  (/,  m),  l</<n,l<m<n 
PC  istype  PROCESSORS  (/,m),l  <  /  <  n,l  <  m  <  n  HAS 
HEARS  PA(  USES  Ai,k,  1  <  i  <  n) 

HEARS  PB( USES  Bk.n,,!  <k<n) 

I  If  m  >  1  then  HEARS  PCj,m-i 

I  if  Z  >  1  then  HEARS  PCi_t,n» 

D  istype  OUTPUT  ARRAY(Z,m),l<Z<n,l<m<n 
PD  istype  PROCESSORS  HAS  l<Z<n,l<m<n 

HEARS  PCj^„j,  l<Z<n,  l<m<n 

(USES  Ct,^,l  <  Z  <  n,  1  <  m  <  n) 


V»€{l...n}  ©(1) 

Vi€{l...n}  0{n) 

Ujj  =  Cij  ©(n*) 


Then  rule  HEAR— BY— CHAIN  is  ^>plied  twice,  and  rule  SEND— BY— CHAIN 
once,  finishing  the  derivation. 

A  istype  ARRAY  (Z,m),l  <  Z  <  n,l  <  m  <  n  INPUT 
PA  istype  PROCESSORS  HAS  <Z<f*,l<m<n 
B  istype  ARRAY  (Z,m),l  <Z<n,l<m<n  INPUT 
PB  istype  PROCESSORS  HAS  Bij^,  l<Z<n,l<m<n 
C  istype  ARRAY  (Z,m),l  <  Z  <  n,l  <  m  <  n 
PC  istype  PROCESSORS  (Z,  m),  1  <  Z  <  n,  1  <  m  <  n  HAS  Cj.„» 

I  if  m  =  1  then  HEARS  PA(  USES  A(,fc,  1  <  Jk  <  n) 

I  if  Z  =  1  then  HEARS  PB(  USES  B*  1  <  *  <  n) 

I  if  m  >  1  then  HEARS  PCj,m_i(  USES  A<,*,  1  <  *  <  n) 

I  if  Z  >  1  then  HEARS  PC|_,.m(  USES  B*,™,  1  <  *  <  n) 


e 


o 


c 
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D  istype  OUTPUT  ARRAY  <  /  <  n,l  <  m  <  n 

PD  latype  PROCESSORS  HAS  A.m,  l</<n,l<m<n 
USES  Q.„,l  </<n,l<m<n 
HEARS  PCj,m.  l</<n,l<m<n 
I  (inPCj.TO.l  <  /  <  n,l  <  m  <  n)  : 


e(n) 

fc6{l...n} 

( in  PD)  : 

^l,m  *  ^l,m 

©(1) 

3.3  Virtualization  and  Aggregation 
3.3.1  An  Informal  Description 

In  virtuzilization  we  select  a  variable  (which  might  be  a  hidden  variable  such  as  the 
accumulation  variable  in  a  reduction  operation)  that  receives  assignments  in  a  loop,  ex¬ 
plicate  it  if  necessary,  and  provide  it  with  enough  indices  to  meet  the  single  assignment 
condition.  If  this  is  done  as  often  as  possible  to  a  specification,  the  value  of  each  variable 
will  depend  on  a  constant  amount  of  computation,  independent  of  the  problem  size.  Since 
the  size  of  the  virtualized  structure  will  not  be  independent  of  the  problem  size,  and  since 
communication  is  such  that  the  running  time  of  the  problem  will  be  polynomial  in  that 
size,  a  fully  virtualized  structure  will  make  light  use  of  each  processor. 

We  can  make  use  of  this  fact  either  by  pipelining  or  aggregation. 

Consider  the  enumeration: 

Cij  =  H  “»**’*; 

*€{1 . n-l} 

There  is  an  enumeration,  but  only  over  values,  not  destinations.  For  this  reason,  use 
of  separate  processors  will  not  be  generated  for  the  steps  of  the  enumeration.  1  Now  one 
can  make  a  few  changes  to  the  specification  in  order  to  generate  separate  processors  for 
the  steps  of  the  enumeration.  (We  will  see  the  need  for  separate  processors  below.) 


Generate  the  following  virtualixationf  creating  the  array  C 

e  latype  ARRAY  ,n},{l,...,n},{0,...,n}) 

V  he  (l,...,n)  do 

ei^,k  =  ci^,k-i  +  aikbjg 

This  structure  represents  several  changes: 

•  First,  it  introduces  a  new  dimension  to  the  main  array  for  each  level  of  enumeration 
performed  to  find  a  value  for  the  old  elements  of  the  array. 

•  Second,  the  enumeration  h  €  {1 . . .  n}  into  the  enumeration  h  £  (1 ...  n)  is  changed. 
This  is  perfectly  legitimate — the  set  enumeration  does  not  forbid  enumeration  in  a 
specified  order.  When  we  consider  automating  this  process,  however,  we  should  re> 
member  that  there  are  n!  ordered  enumerations  corresponding  to  a  q}ecific  unordered 
one  of  length  n.  The  best  orderings  to  try  will  probably  include  the  arrival  order¬ 
ings  inferrable  from  HEARS  and  HAS  clauses,  and  the  ‘natural*  orderings,  i.e., 
numerical  order  and  inverse  numerical  order  (where  numbers  are  involved). 

Of  course,  this  only  applies  when  the  inner  enumeration  (s)  enumerate  over  a  set. 
When  the  enumerand  is  already  a  sequence,  this  step  is  unnecessary. 

•  An  identity  for  the  enumeration’s  operation  is  selected.  This  can  be  artificial,  a 
special  null  value  that  is  checked  for. 

•  Fourth,  an  ordering  for  the  enumeration  is  selected. 

•  Fifth,  explicit  code  to  create  running  totals  is  generated. 


In  nggreg&tion,  instead  of  multiplying  the  number  of  processors  we  group  processors 
into  one.  We  require  that  each  processor  have  insufficient  work  to  occupy  the  time  that 
the  parallel  structure  requires  for  solution.  This  can  arise  because  the  processors  need  to 
wait  for  partial  results  from  other  processors. 

When  this  is  the  case,  the  processors  can  be  collected  into  groups  that  don’t  share  any 
processing  times.  Each  group  is  then  replaced  by  a  single  processor  that  is  responsible  for 
all  of  the  values  computed  by  any  member  of  the  group. 

Aggregation  is  needed  to  get  efficient  parallel  structures  after  virtualization,  because 
virtualization  produces  specifications  that  would  be  transformed  into  structures  that  have  a 
constant  amount  of  work  per  processor.  This  is  much  less  than  the  amount  of  time  available 
to  the  system,  because  after  virtualization  there  must  be  chains  of  processors  whose  length 
is  linear  in  some  measure  of  the  size  of  the  problem.  Heuristically,  virtualization  makes 
too  many  processors  and  aggregation  is  necessary  to  undo  this. 

The  power  of  the  techniques  arises  from  their  ability  to  together  arrange  for  different 
parts  of  the  work  of  computing  a  single  element  of  the  answer  to  be  performed  in  different 
processors. 

3.3.2  Definitions  of  Virtualization  and  Aggregation 

Definition  3.7  A  virtualization  of  a  parallel  structure  is  a  new  parallel  structure  that 
results  from 

•  adding  a  dimension  to  an  array,  say  A,  producing  A  as  follows:  if  A/  is  a  defined 
element  of  A,  and  the  computation  of  A/  is  performed  by  enumerating  n  elements  of 
some  set  or  vector  S  and  performing  a  binary  operation  on  a  running  total  and  each 
element  of  i?  as  it  is  enumerated,  then  Ai|m  for  0  <  m  <  n  will  be  a  defined  element 
of  the  new  array.  A; 


•  making  the  enumeration  of  5  an  ordered  one;  and 

•  replacing  the  original  enumeration/calculation  with  a  calculation  that  explicitly  folds 
the  j*^  value  of  the  ordered  enumeration  as  performed  for  A/  by  operating  on 

and  that  element. 

The  process  of  creating  a  virtualization  is  also  called  virtualization. 

Definition  3.8  An  aggregation  of  a  paralltl  structure  m  s  neu;  parallel  structure  tkat 
results  from  partitioning  the  old  set  of  processors  of  a  famUg  into  equivalence  classes^  and 
creating  a  processor  for  each  equivalence  c/oss.  A  processor  tn  the  aggregation  UkARs 
another  $ueh  processor  if  any  processor  in  the  first  equivalence  class  HEARd  any  processor 
in  the  second. 

The  process  of  creating  an  aggregation  is  also  called  aggregation. 

There  are,  of  course,  an  intractable  number  of  possible  aggregations  according  to  this 
definition.  Only  simple  aggregations  are  worthy  of  consideration,  because  allowing  complex 
ones  would  lead  to  a  combinatorial  explosion;  also,  the  complex  ones  would  tend  either  to 
leave  too  many  interprocessor  connections  or  to  have  too  much  work  being  done  in  some 
of  the  processors. 

Suppose  that  the  unaggregated  family  of  processors  is  defined  as 

P«,.«a,....«.(enumers) 

We  will  use  Vx  to  represent  this  below.  We  use  a  notation  for  aggregations  that  describes 
sets  of  processors  that  are  identified  by  the  aggregation.  As  an  example,  suppose  we  have 
an  m  X  n  array  of  processors  with  a  family  name  P,  and  we  want  to  create  a  processor 


family  with  n  processors,  one  from  each  column.  The  one  &om  column  t  will  be  Q,-.  The 
notation  in 

Q  istype  AGGREGATION  (i),  1  <i  <n-l\ {P,-,- :  1  <  y  <  m  -  1} . . . 

(where  the  usual  HAS,  HEARS,  TALKS  follows). 

The  aggregations  we  will  consider  for  a  family  of  processors  organized  as  an  d- 
dimensional  array  whose  bounds  are  the  vectors  L  and  U  can  be  categorized  as  follows: 
Consider  the  d  element  vectors  each  of  whose  elements  is  either  —1,  0  or  1  and  some  of 
whose  elements  are  non-zero.  We  will  consider  an  aggregation  that  produces  a  new  pro¬ 
cessor  for  each  set  of  processors  {Px+jl  :  L  ■<  X  +  jl  ■<  U}  where  L  ■<  X  has  the  usual 
meaning  of  Vl  <  j  <  d[/y  <  Xj]. 

3.3.3  Systolic  Structure  Synthesis 

We  now  study  distributional  problems  preferably  implemented  by  a  systolic  array.  For 
a  specific  example,  suppose  we  are  evaluating 

Vi  €  (1, . . . ,  m)  ;  must  be  enumerated  in  order 

Vj  e  {1, . . .  ,n}  ;  may  be  enumerated  in  any  order 

Bj  =  Bf  +  A,' 

and  suppose  that  I  and  n  are  of  the  same  order  of  magnitude,  or  that  all  of  the  B-values 
are  in  a  single  place  and  we  choose  not  to  distribute  them.  A  systolic  structure  is  preferable 
to  a  tree. 

Consider  the  structure  below: 


•  •  •  * - |Bi  I  < - [Ba  I  < - -  •  •  < - [  Bn  I  < -  (feed  in  m  A-values) 

Figure  3.6:  Simple  Puallel  Structure  for  Broadcasting 

in  which  each  of  the  m  A-values  is  added  to  each  of  the  n  B-values.  Note  that  no  source 
of  B-values  is  given;  each  processor  must  be  connected  to  B’s  I/O  processor. 

We  explicate  the  m  partial  sums,  using  virtualisation.  This  creates  a  separate  processor 
for  each  step  in  the  summation  process  for  each  of  the  B-values. 


B-values 


here) 

Figure  3.7:  Virtualized  Broadcast  Structure 

Then  we  modify  that  slightly  to  feed  in  A-values  in  only  one  place. 
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Figure  3.8:  Virtualized  Broadcast  Structure  with  Chains  for  I/O 


We  then  identify  P,y  with  Pi+k,j~k  for  all  appropriate  k: 


feed  in  B  vector 


HE} — -IE} 


take  out  answers 


' 

Pn+»»» 

feed  in  A  vector 


Figure  3.9:  Aggregation  of  Virtualized  Broadcast  Structure 


This  parallel  structure  is  better  than  a  structure  synthesized  directly  from  the  speci¬ 
fication  because  it  does  not  impose  strenuous  requirements  on  the  I/O  capabilities  of  the 
system.  The  specification  does  not  say  how  B-values  get  to  the  various  B,-.  If  this  were 
exposed,  we  would  see  that  the  assumption  is  made  either  that  the  broadcast  problem 
was  embedded  in  a  larger  problem  that  allows  the  data  to  already  be  there,  or  that  all  n 
B-processors  HEAR  the  I/O  processor.  The  systolic  array  shown  above  allows  the  I/O 
processors  to  be  connected  to  only  a  single  processor. 
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3.3.4  Use  of  Virtualization,  and  Aggregation  for  Matrix  Multiplication 


Consider  the  case  of  synthesizing  parallel  structures  for  matrix  multiplication  that  work 
in  linear  time.  Application  of  the  techniques  previous  to  virtualization  and  aggregation 
produces  the  following  parallel  structure: 

A  istype  ARBAY  (/,  m),  1  <  /  <  n,  1  <  m  <  n  INPUT 
PA  istype  PROCESSORS  HAS  <l<n,l<m<n 

B  istype  ARRAY  (/,m),  1  </<n,  l<m<n  INPUT 
PB  istype  PROCESSORS  HAS  Bj.m.l  </<n,l<m<n 
C  istype  ARRAY  (/,»n),l  </<  n,l  <  m  <  n 
PC  istype  PROCESSORS  </<n,  l<m<n  HAS 

if  m  =  1  then  HEARS  PA(  USES  Aijt.l  <k<n) 
if  /  =  1  then  HEARS  PB(  USES  1  <  k  <  n) 
if  m  >  1  then  HEARS  PC{,m-i(  USES  ^  ^  n) 

If  /  >  1  then  HEARS  PC|-i.«{  USES  Bk.m,  1  <  *  <  n) 

D  istype  OUTPUT  ARRAY  (/,m),l  l<m<n 

PD  istype  PROCESSORS  HAS  </<n,l<m<n 


USES  </<n,l<m<n 
HEARS  PCj,,„,l  <  1  <  n,l  <  m  <  n 


0(n) 

k€{l...n} 

*  ^t,m 

0{1) 

The  asymptotic  behavior  of  this  parallel  structure  seems  to  be  the  same  as  that  for 
Kung’s  parallel  structure  [KuLT6].  However,  there  can  be  an  advantage  of  Kung’s  parallel 
structure  over  the  simpler  one.  When  multiplying  “band  matrices”,  where  j  -  «<ko,o  V 
j  -  i>kifl  =>  Aij  =  0  and  j  -  i<ko,i  V  j  -  ^  Aj  =  0,  it  is  possible  to  use  fewer 

processing  elements.  If  A:i,o  -  lbo,o  +  1  =  wo  and  ki,x  -  fci,o  +  1  =  u/i,  then  it  can  be 
shown  that  only  (wq  +  ti;i)n  of  the  processors  of  our  parallel  structure  can  have  non¬ 
zero  answers,  and  only  that  many  processors  have  to  be  provided.  With  Kung’s  parallel 
structure,  however,  only  tvowi  processors  have  to  be  provided.  The  multiplication  takes 
<9(n)  time.  (It  is  possible  to  use  the  d((wo-|-ti;x)n)  processors  to  multiply  the  band  matrices 
in  6(ti;o  +  ivi)  time,  but  this  parallel  structure  cannot  be  synthesized  automatically  using 


these  techniques,  aind  in  any  event  the  time/processors  tradeoff  offered  by  Kung’s  parallel 
structure  may  be  desirable.) 

The  virtualization  process,  alone,  is  not  enough  to  synthesize  Kung’s  systolic  arrays. 
Notice  that  the  number  of  processors  in  the  parallel  structure  that  results  from  the  obvious 
virtualization  is  0(n*).  Partial  sums  of  product  array  elements  reside  in  different  processors 
at  different  times.  This  feature  makes  some  technique  like  virtualization  necessary  to 
separate  the  computation  of  partial  products,  but  processors  have  to  be  grouped  to  prevent 
this  processor  count  blowup.  Another  more  difficult  technique,  aggregation,  will  reduce  the 
processor  count  to  the  target  level. 

Heuristically,  aggregation  is  the  grouping  together  of  processors,  each  of  which  does  a 
small  unount  of  work,  into  groups  of  processors,  each  represented  by  a  single  processor. 
Each  processor  does  all  of  the  work  that  any  processor  in  the  original  group  did,  but  this 
can  still  be  done  quickly  because  each  of  the  processors  in  the  original  group  had  a  small 
amount  of  work  to  do,  and  no  two  processors  had  to  do  their  work  at  overlapping  times. 

The  reason  why  Kung’s  parallel  structure  can  multiply  matrices  in  linear  time  using 
constant  space  per  processor  is  that  he  has  performed  a  virtualization  on  the  summation 
of  result  array  elements.  He  avoids  the  need  for  processors  by  a  process  called  processor 
aggregation.  Each  processor  is  responsible  for  computing  &(n)  elements  of  the  virtual  array. 

Reasoning  similar  to  that  performed  in  the  change-of-beisis  generator  and  theorem 
prover  will  serve  us  well  here.  The  target  interconnection  structure  is 
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P  Istype  PROCESSORS  (/,  m),  — n  <  m  <  n,  — n  </<n  —  m  +  1 

HAS  Cij,k,l  <i<n,l<j<n,l<k<n,i-j  =  l  — in,k  =  inin{ii-l  +  l,n  +  in  +  l,n) 
if  m  <  n  then 

HEARS  (USES  Aj,  1  <  »  <  n,  1  <  j  <  n,i  -  j  =  /) 

If  —  n  <  m  <  n  then 

IJNKS  (PASSES  Aij,l  <i<n,l<j<n,i-j  =  l) 

if  /  >  -n  then 

HEARS  P~i,m  (USES  B,j,l  <  i  <  n,l  <  j  <  n,i  —  j  =  m) 

If  n  >  /  >  -n  then 

UNKS  Pi-t,m>Pi+i,m  (PASSES  <i<n,l<j<n,i-j  =  m) 

If  /  <  n  A  m  >  — n  then 

HEARS 

(USES  <i<n,l<j<ii,l<k<n,i-j  =  l-  m,k  = 

imn(n  —  1  +  l,n  +  m  +  l.n)) 
if  /  >  — n  A  m  <  n  then 

TALKS 

(SENDS  Cyjk,!  <i<n,  l<j<n,l<k<n,i-j  =  l-  in,k  = 
inin(ii  —  l  +  l,n  +  m+  l,n)) 


which  is  Kung’s  structure.  This  requires  two  changes  of  basis  of  input  matrices  (t  —  j 
of  both  A  and  B,  rather  than  either  t  or  j),  and  a  change  of  basis  for  the  C  array,  as  well 
as  replacement  of  the  summation  of  each  C>array  element  over  a  set  of  integers  with  a 
summation  over  a  sequence  of  integers. 


The  figures  below  illustrate  the  virtualization  and  aggregation  processes  as  they  apply 
to  an  n  =  3  instance  of  a  matrix  multiplication  problem. 


Figure  3.10  shows  the  basic  topology  of  the  matrix  multiplication  parallel  structure; 
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Figure  3.10:  Unvirtualized  Structure 


Figure  3.11  shows  the  topology  after  the  virtualization  has  been  performed  over  the 
suRunation; 


C  OUT 


A  IN 


B  IN 


Figure  3.11:  Virtualization 


Figure  3.12  shows  the  sets  of  processors  that  are  to  be  collapsed  into  a  single  processor; 
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Figure  3.12:  Aggregation  to  be  performed 


and  Figure  3.13  shows  the  configuration  that  results,  in  a  form  that  would  be  more 


recognizable  to  the  student  of  [KiiL76|. 

c 
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Figure  3.13:  Aggregated  (Systolic)  Structure 


3.3.5  What  Virtualization  Can  and  Cannot  Accomplish 


An  important  measure  of  the  cost  of  a  parallel  structure  is  the  product  of  the  number 


of  processors,  the  size  of  each  one,  and  the  amount  of  time  the  parallel  structure  takes  to 
do  a  calculation.  We  will  call  this  the  PST  measure. 


PST  =  ^((wo  +  u;i)n^)  for  the  simpler  parallel  structure  for  matrix  multiplication, 
when  applied  to  band  matrices  of  widths  wq  and  wi.  Virtualization  and  aggregation  can 
improve  this  to  0{wowin)  by  reducing  the  number  of  processors  while  allowing  the  size  of 
the  processors  and  the  running  time  of  the  algorithm  to  remain  the  same. 

It  is  possible  to  achieve  PST  =  0((tuo  +  ti;i)*n*)  by  other  means.  This  is  equivalent 
whenever  wi  =  0(wo).  Divide  the  n  x  n  array  of  potential  processors  into  (u>o  +  u>i)  x 
(wq  +  tui)  blocks  and  introduce  input  and  output  connections  at  the  appropriate  edges  of 
each  such  block.  This  is  impossible  to  derive  by  techniques  shown  so  far,  or  reasonable 
extensions  to  them.  It  has  the  further  disadvantage  that  the  number  of  connections  to 
input  and  output  processors  is  0(n),  while  the  same  number  is  d(wotni)  for  the  systolic 
array  parallel  structure  that  results  from  virtualization  and  aggregation.  A  complexity 
measure  that  took  into  account  the  connections  to  the  I/O  processors  would  favor  the 
systolic  array  structure  even  over  the  improved  simple  matrix  multiplication  scheme. 

It  should  be  noted  that  the  parallel  structure  resulting  from  partitioning  the  potential 
processors  has  the  same  PST  as  systolic  arrays,  but  P  and  T  are  different.  Different 
measures,  such  as  PST^  may  make  different  paraUel  structures  more  desirable. 

3.4  User- Assisted  Aggregation 

We  have  an  arbitrary  set  of  aggregations  we  will  consider  in  order  to  keep  the  size  of 
the  algorithm  search  tree  reasonable,  but  there  may  be  cases  where  it  would  be  advisable 
to  consider  other  aggregations.  For  this  reason,  when  the  system  considers  an  aggre¬ 
gation  it  codes  this  information  without  hidden  information,  allowing  the  user  to  insert 
AGGREGATE  declarations  of  his  own.  The  reason  we  chose  to  allow  this  rather  than 
to  make  an  attempt  to  have  the  aggregation  finding  machinery  be  complete  are: 


•  The  bounds  of  arrays  are  often  arbitrary. 


•  There  are  many  aggregations  available;  it  is  not  clear  which  are  useful. 


•  It  is  not  uncommon  for  two  logical  data  to  share  parts  of  an  array. 


•  It  is  possible  for  one  array  to  match  the  combination  of  two  others. 


For  these  reasons,  TraNSCONS  understands  the  AGGREGATION  type,  of  the 


Name  Istype  AGGREGATE  (bound)  itere  =  I^et^ound 
HAS  Aeeiiound 

(HEARS  Pname2p^^oun^  iiera2^ovnd 
(USES  Aaet2iotind  •  •  •)) 

HAS...; 

This  statement  is  a  parameterized  statement.  The  itere  is  a  predicate  defining  permis* 
sible  bindings  of  the  variables  in  the  list  bound.  It  means  that  those  processors  in  the  set 
(which  is  a  set-valued  expression)  are  aggregated  (i.e.,  identified  to  form  a  angle 
processor  named  Nameiound)  for  each  binding  of  the  variables  in  bound  that  is  permitted 
by  the  predicates  in  itere.  It  is  explicitly  permitted  that  the  set-valued  expression  can 
include  enumerated  elements  and  explicit  setformers. 

The  HAS,  HEARS  and  USES  elements  are  analogous  to  those  of  a  PROCESSORS 
declaration  (see  above),  although  in  an  AGGREGATION  type  there  can  be  more  than 
one  HAS  clause.  HEARS  clauses  are  associated  with  specific  HAS  clauses. 

When  the  user  provides  such  assistance,  searching  a  potentially  enormous  set  of  possible 
interfamily  aggregations  is  avoided.  TRANSCONS’s  abilities  are  used  to  check  the  validity 
of  the  user-proposed  aggregation. 

The  following  consistency  checking  is  performed  on  AGGREGATION  declarations: 

(i)  formally  specified  conditions: 


•  Pact  is  disjoint  for  all  distinct  settings  of  bound  and  for  all  settings  of  the  respective 
bounds  for  two  AGGREGATION  declarations. 

•  Aame({specific  bound))  HAS  {array  element)  iff  3P{F)  C  Hiet({specific  bound)) 
that  HAS  {array  element). 

if  Pname2  =  Name,  then 

'ibound  s.t.  itera,  bound2  =  F{bound),  bound  ^  bound!  : 

3/\s  €  PSETioundi  fS>4  €  Ihetbound2  : 

Pts  HEARS  /\,4(USES  A) 

(meaning  that  the  HEARS  clauses  of  the  AGGREGATION  declaration  are  those 
induced  by  the  underlying  processors); 

(ii)  informally  specified  conditions: 

«  The  order  of  total  amount  of  computation  done  by  processors  underlying  a  given 
node  does  not  exceed  the  length  of  the  longest  chain. 

•  It  is  not  true  that  A  HEARS  B  and  B  HEARS  A  for  the  same  USES  datum. 
(But  violation  of  this  condition  is  likely  to  imply  violation  of  others) 

It  is  important  to  note  that  a  user’s  aggregation  declarations  are  indistinguishable  (to 
the  system)  from  automatically  inserted  ones. 
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Chapter  4 

Trees,  Closures  and  Divide  &  Conquer 
or,  LAMBDA:  The  Ultimate  Transceiver* 


4.1  Motivation 

Suppose  information  must  flow  from  processor  B  to  processor  A,  but  there  is  a  concep* 
tual  advantage  to  viewing  the  problem  as  if  information  were  flowing  the  other  way.  We 
have  two  motivating  situations  in  which  this  is  the  case.  One  is  the  handshake  problem, 
in  which  an  intermediate  processor  in  a  pipelined  chain  of  processors  must  be  able  to  de¬ 
clare  its  readiness  to  handle  another  datum  after  it  has  processed  a  first.  The  second  is 
problems  requiring  tree-structured  collections  of  interconnected  processors.  We  would  like 
to  use  divide  ic  conquer  (D&C)  to  synthesize  these  trees,  but  that  technique  is  difficult  to 
apply  if  data  conceptually  flow  both  up  ud  down  the  tree.  It  becomes  easier  if  the  flow 
is  conceptually  one  way.  We  claim  that  D&C  is  a  powerful  synthesis  technique  that  can 
produce  a  large  class  of  tree  structured  architectures  if  problems  can  be  rephrased  in  terms 
of  one-way  data  flow. 

We  want  to  bring  about  a  structure  in  which  information  flowing  from  a  processor  A  to 
another  processor  B  tells  B  what  to  do  with  other  information  computed  in  B  but  needed 


With  spologiM  to  Gny  Stsele  [SteTT] 


v.  ST  V'*’* 
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by  A.  We  then  want  to  restrict  our  synthesis  task  to  reasoning  about,  computing  and 
sending  the  datum  from  A  to  B.  We  need  a  new  type  of  datum,  the  closure.  Processor  A 
sends  processor  B  a  closure,  2Uid  B  can  later  use  it  to  cause  data  to  be  sent  back  to  A  and 
to  be  used  properly.  We  explore  a  series  of  requirements  that  must  be  met  for  D&C  to 
be  used  to  synthesize  tree  structures.  We  show  that  if  these  requirements  are  met  we  get 
efficient  tree  parallel  structures.  We  then  describe  closures,  including  some  technical  issues 
that  make  plausible  that  they  are  implementable  in  a  reasonable  computation  model.  This 
allows  us  to  show  that  the  DicC  requirements  are  not  restrictive,  provided  we  allow  closures, 
by  exhibiting  a  constructive  proof  that  there  is  a  structure  that  meets  our  requirements 
equivalent  to  any  member  of  a  broad  class  of  structures  that  do  not.  We  end  this  Chapter 
by  exhibiting  the  syntax  we  use  to  describe  trees  and  some  formal  axioms  about  them. 


4.2  The  Divide  &  Conquer  Paradigm  and  Tree  Synthesis 


D&C  is  a  widely  used  technique  for  the  synthesis  of  single-processor  programs,  and  one 
feels  that  it  should  be  a  good  technique  for  the  synthesis  of  tree>shaped  parallel  structures. 
TVouble  often  arises,  however,  when  we  try  to  use  D&C  for  this  purpose. 


Consider  what  the  DlcC  technique  actually  is.  "To  solve  a  ‘large’  problem  instance, 
break  it  into  pieces,  solve  the  problem  for  each  of  the  pieces,  and  combine  the  solutions” . 
This  is  a  technique  for  generating  0(n)  and  O(nlogn)  time,  single  processor  solutions  to 
a  wide  variety  of  problems.  See,  for  example,  [SmiSS],  [Knu69]  and  [AHU74]. 


Intuition  would  lead  us  to  believe  that  D&C  is  useful  for  synthesizing  tree-structured 
parallel  structures,  because  the  structure  of  a  solution  closely  matches  the  structure  of  the 
desired  set  of  processors.  Three  classes  of  problems  arise,  however: 
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•  rootlock:  When  we  try  to  combine  eolutions  for  subproblems,  the  amount  of  in¬ 
formation  traveling  either  from  one  subproblem  to  another  or  from  the  subproblems 
to  the  combination  operator,  or  the  amount  of  work  necessary  to  combine,  may  be 
asymptotically  large  in  the  problem  size.  A  nuvely  synthesized  parallel  structure 
would  have  to  perform  all  of  this  work  in  one  processor,  namely  a  *root”  processor 
that  has  responsibility  for  combinmg  partial  solutions  into  a  solution  to  the  whole 
problem. 

•  sequentiality:  In  a  variant  of  D&C,  one  solves  one  of  the  subproblems  firaty  and 
uses  some  function  of  the  solution  as  a  parameter  to  the  process  that  takes  place 
elsewhere.  It  is  clear  that  in  this  case  no  problem  element  can  enter  the  computation 
imtil  all  previous  elements  have  been  used.  There  is  no  concurrency. 

•  bidirectionality:  Information  might  have  to  flow  both  up  and  down  the  tree  to 
obtain  a  solution.  This  situation  can  make  formal  description  of  a  combination 
operator  for  D&C  hard.  It  might  appear  that  this  condition  is  intrinsic  to  DScC, 
but  that  is  not  the  case.  The  data  could  already  be  distributed  among  an  array  of 
processors  (or  avaUable  to  be  so  distributed)  and  the  division  step  can  manipulate 
indices  only. 

It  is  possible  to  have  bidirectionality  without  sequentiality,  but  not  vtce  versa.  Rootlock 
is  independent  of  the  other  two  situations. 

These  three  properties  of  D&C  solutions  to  specifications  are  impediments  to  easy 
synthesis  of  tree-structured  parallel  structures  for  these  specifications. 

A  specification,  three  of  whose  natural  D&C  solutions  have  one  of  these  features  each, 
is  Prefix  Summation.  In  this  specification,  we  have  a  one  dimensional  vector  A  of  length 
n,  and  we  want  to  create  a  vector  A  such  that  VI  <  t  <  n[ai  =  *j]-  what  foUows 
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we  use  the  words  “left”  and  “right”  as  if  the  array  were  arranged  in  a  row  with  Ui  leftmost 
and  On  rightmost. 

One  solution  is  “to  perform  prefix  summation  on  a  non-trivial  vector,  divide  it  into  two 
halves,  perform  prefix  summation  on  each  half,  and  add  the  rightmost  element  of  the  left 
result  to  each  element  of  the  right  result”.  This  solution  has  bidirectionality. 

A  second  solution  is  to  first  define  “augmented  prefix  summation  with  augend  z”  as 
VI  <  I  <  nfuj  =  z  +  Sj].  We  then  say  that  to  perform  augmented  prefix  summation 

with  augend  z  on  a  non-trivial  vector  a(;u,  divide  it  into  two  halves  ai-M  and  au+i:u,  per¬ 
form  augmented  prefix  summation  with  z  on  the  left  half,  and  perform  augmented  prefix 
summation  with  z  -I-  Uu  on  the  right  half.  This  is  intrinsically  sequential. 

A  third  solution  is  similar  to  the  first,  except  that  the  result  vector  is  carried  up  the 
tree  as  the  value  of  the  D&C  step  rather  than  having,  as  the  goal,  to  develop  the  new 
values  at  the  leaves.  This  has  rootlock,  i.e.,  it  is  an  0(n)  solution  when  implemented  on 
a  tree-structured  multiprocessor  system  in  a  natural  manner  as  it  requires  funnelling  the 
entire  result  vector  through  the  root. 

In  the  remainder  of  this  Section  we  formalize  the  problems  described  above.  We  also 
show  that  if  a  D&C  formulation  has  none  of  these  problems  then  there  exist  fast  parallel 
structures  that  are  functionally  equivalent.  We  then  exhibit  the  syntactic  structures  we  use 
to  describe  trees.  We  then  describe  the  closure  concept  and  we  argue  that  use  of  closures 
makes  the  imposition  of  our  requirements  much  less  restrictive  than  otherwise  by  showing 
that  a  broad  class  of  structures  not  meeting  them  can  be  syntactically  transformed  into 
ones  that  do.  Finally  we  show  that  use  of  these  closures  causes  no  loss  of  computation 
speed  (within  a  constant  factor). 

In  the  following,  we  will  describe  programs  genetically  using  tehemata,  in  the  usual 
sense.  A  scheme  is  a  program  fragment  in  which  some  operators  are  merely  given  names 
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rather  than  being  described.  If  all  of  the  names  are  given  an  interpretation  a  scheme 
becomes  a  program  or  a  scheme  instance  or  an  instance  of  the  scheme.  Below  we  define 
sequentiality  and  bidirectionality  and  a  notion,  P~eombinatiotty  or  a  combination  operator 
that  makes  the  scheme  suitable  for  an  implementation  in  which  solutions  to  the  subprob* 
lems  are  developed  in  parallel  in  separate  processors,  delivered  to  a  third  processor,  and 
combined  by  that  processor. 


In  the  schema  that  follow,  upper  case  letters  with  subscripts  and  superscripts  are 
parametrized  functions  of  appropriate  arity.  The  subscripts  and  superscripts  are  inte¬ 
gers  representing  a  range  of  elements  from  the  problem  instance,  ud  the  actual  problem 
represented  by  a  function  in  this  notation  depends  on  the  values  of  the  range  parameters 
and  on  the  values  from  the  problem  instance  in  that  range. 


Definition  4.1  A  speeifieation  is  P-conblnatlonal  if  it  is  an  instance  of  this  scheme: 


(Wi,  if/  =  u 

otherwise 


A  specification  is  sequential  if  it  is  not  P~eombinational  and  it  is  an  instance  of  one 
of  these  schemata: 


m, 

if /  =  u 

otherwise 

fw,. 

If /  =  « 

IgW(VJVi),VJVi). 

otherwise 

A  specification  is  bidirectional  if  it  is  not  P-eombinational  or  sequential  and  it  is  an 
instance  of  one  of  these  schemata: 


r.v 


•  -V-Vl 


A- 


vs 


G(Vi“,^^+i(Vi“,V^“.i)).  otherwise 


Vi 


U 


I 


or 


(Wi,  if/  =  u 

Vi“=  { 

lG(H,“(V-i,V;«),VJt,i).  otherwise 

The  important  point  of  these  definitions  is  that  the  result  of  the  recursion  for  one  of 
the  parts  of  the  problem  after  the  division  step  is  used  as  an  argument  to  the  function  that 
computes  the  other  part  of  the  result  {Hf).  The  treatment  of  half  of  the  problem  depends 
on  the  solution  of  the  other  half. 

There  are  some  lemmas  to  prove  before  the  main  results  of  this  Chapter.  In  what 
follows,  we  will  use  T((a  value))  or  r((a  function))  to  denote  the  time  required  for  an 
optimal  parallel  structure  to  compute  the  value  or  the  function.  T((a  processor))  is  the  time 
for  that  processor  to  generate  its  result  (where  the  meaning  is  obvious).  In  all  cases  where 
we  use  this  notation,  it  will  be  obvious  how  the  evaluation  is  to  take  place.  We  also  assume 
that  there  exists  a  monotonically  increasing  function  F  such  that  T(G|')  <  F(u  -/  +  !). 


Lemma  4.1  If  we  have  a  DieC  eeheme  instance  of  the  following: 


(Wi, 


if /  =  u 
ViJ+i)>  otherwise 


then  T{Vn  <  max(T(v/(‘+“J/*J),  +  0{T{G)) 
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Proof:  By  the  definition  of  T(. . .),  there  exist  two  parallel  structures,  one  that  computes 

pjL('+«)/*J  computes  in 

Let  the  processor  that  develops  be  called  I\  and  that  which  develops 

Vf“j+u+i)/2l  called  P,.  Connect  a  new  processor  to  each  of  Pi  and  P,,  and  call  it 


P  and  Pr  develop  their  results  in  an  r(V'j-“j^^^jjyjl),  respectively.  The 

amount  of  time  required  to  communicate  both  of  these  results  is 

<0(P([(u  —  /  +  l)/2]))  (by  monotonicity  of  F.) 


This  requires  the  observation  that  the  time  requited  to  develop  a  result  is  at  least 
proportional  to  the  result’s  length. 


Once  the  results  are  both  delivered  to  P,  it  computes  its  result  in  T{G),  making  the 
total  time  for  the  computation 

max  (develop  partials) 

+0(P([(u  -  /  +  l)/2]))  (by  above  observation) 


+0(F(u-l+l)) 


(by  hypothesis) 


and  the  two  last  bounds  coalesce  by  the  fact  that  F  is  monotonically  increasing.  | 


We  can  now  prove  that  the  computation  of  the  closure  in  the  root  node  is  fast; 


83 


Theorem  4.2  Suppose  a  problem  fits  a  DieC  scheme  with  P-combination.  That  is,  that 
the  computation  of  the  result  in  question  for  the  substring  of  the  problem  ranging  from  I  to 


(W,, 

'■“  =  < 


if /  =  u 


'u+i)>  otherwise 


and  T(G)  (the  time  to  compute  G)  is  <  0(F(u  —  I  +  1)),  where  F  is  a  nondecreasing 
function.  Then  T(VY*)  =  0(F{n)\ogn). 

Proof:  Note  that  the  form  of  the  definition  of  Vj“  precludes  sequentiality  and  bidi¬ 
rectionality.  We  are  using  v^llue  semantics  for  the  call  to  G.  T{Vf)  =  T(y),  so  T(V) 
is  bounded.  Say  T(G)  <  coF{u  —  /  +  1).  We  prove  by  induction  that  T(Vj“)  < 
co/'(u  -  /  -t-  1)  lg(u  -  /  +  1)  -f-  T(V),  where  cq  is  the  constant  of  T{Vi)  =  0{F{n)  log  n). 

The  base  case  is  immediate. 


If  /  ^  u  then 


r(Vi“)  =  max  +  T{G) 

<coF((u  -  I  +  l)/2)  lg((u  -  /  -H  l)/2)  -I-  T{V)  +  T(G) 
<coF{u  -l  +  l)  lg(u  -  /  -t- 1)  -I-  T{V) 


(definition,  nonsequentiality) 

(by  induction) 
(monotonic  F) 


This  is  0{F{u  -  /  1)  log(u  —  I  +  1)),  which  is  0(F(n)  log  n)  at  T{V{^). 


The  proof  requires  that  sequentiality  and  bidirectionality  not  be  present.  If  we  have 
sequentiality  then  T(y,“)  =  max  -I- T(G)  does  not  hold 

because  the  computation  of  the  V’s  can  not  proceed  in  parallel.  If  bidirectionality  is  present 
then  we  must  have  sequentiality.  The  proof  goes  through  even  if  rootlock  is  present,  but 
in  such  a  case  the  theorem  produces  a  weak  result,  since  F(n)  would  be  large. 


fW-RlM  $22  KNOULEOaE-BRSED  TRRNSFORHRTIONRL  SVNTHESIS  OF  EFFICIENT 
STRUCTURES  FOR  CO.  .  <U>  KESTREL  INST  PRLO  RLTO  CR 
R  N  KING  3B  SEP  83  KES.  U.  83.  3  RFOSR-TR-83-1239 
UNCLflSSIFIEO  F49628-83-C-8813  F/G  9/2 


J  ^ 


% 


I 


I 


-V"  A  •_  -S. 


rTT'TTT 


84 


We  must  DOW  briefly  describe  closures,  which  ere  the  objects  we  intend  to  build  using 
D&C.  After  we  describe  them,  we  show  thst  use  of  closures  does  not  produce  slow  parsllel 
structures. 

4.3  Description  of  Closures 

Our  solution  to  the  problem  of  D&C  formulations  that  do  not  meet  our  conditions  of 
lacking  bidirectionality,  sequentiality  and  rootlock  is  based  on  the  idea  of  passing  a  form 
of  data  called  a  eloaure  up  the  tree,  and  therefore  computing  'l>ig”  closures  (ones  that 
perform  a  service  for  a  large  interval  in  the  original  problem  vector)  &om  “small”  ones. 
A  closure  is  a  procedure  or  function  deflnition  together  with  an  environment,  i.e.,  a  set 
of  name/value  pairs.  When  a  closure  is  invoked,  the  procedure  or  function  is  invoked  in 
the  included  environment  as  augmented  by  parameter  binding.  When  processor  A  passes 
processor  B  a  closure,  A  is  said  to  be  the  closure’s  hoat  and  B  the  rtcipient. 

A  closure  consists  of  a  procedure,  and  bmdings  for  some  of  the  procedure’s  free  variables. 
The  procedure,  in  turn,  consists  of  a  piece  of  program  and  a  binding  list.  The  concept  was 
first  described  in  Church’s  A>calculus  [Chu5l].  Closures  are  valued  for  their  expressive 
power  even  on  single-processor  algorithms.  They  are  elements  primarily  of  dialects  of  LISP. 
See,  for  example,  [Ste77],  [Moo82],  [XlR83].  A  similar  concept,  aetora,  is  also  found  in 
other  languages  (see,  for  example,  PLASMA  in  [SmH75].)  Actors  are  also  described  as  a 
method  of  expressing  interprocessor  communications  concepts  [AHe77].  We  here  explore 
a  case  in  which  our  similar  concepts  aids  a  parallel  structure  computerised  synthesis  task. 

The  notation  (AT,  V )]  denotes  an  abstraction  of  a  function  of  n  (n  >  0)  parameters 
from  a  function  of  n  m  (m  >  0)  parameters.  X  represents  n  bound  variables;  these 
variables  are  bound  to  the  values  of  the  n  actual  parameters  when  the  function  is  called. 
Y  represents  m  free  variables,  and  the  values  used  for  them  when  F(Ar,  Y)  is  evaluated  are 
determined  by  some  of  the  context  in  which  the  abstracted  function  is  evaluated. 


We  will  use  “X^[F{X,Y,Z)]” ,  where  again  X,  Y  and  Z  denote  three  groups  of  (respec¬ 
tively)  n,  m,  and  o  variables,  to  denote  a  closure  generating  form  (CGF).  This  is  a  piece 
of  program  text  that,  when  evaluated,  nnakes  a  closure  that  can  be  applied  to  as  many 
parameters  as  there  are  elements  of  the  X  group.  X  is  the  group  of  bound  variables  and  Y 
is  the  group  of  variables  whose  current  values  will  be  stored  in  the  closure.  Z  is  the  group 
of  free  variables  whose  values  at  application  time  are  to  be  used. 

When  a  closure  is  applied,  the  X’- values  from  the  actual  parameters  in  the  application, 
the  V-values  available  at  closure  creation  time  and  the  Z- values  at  application  time  will  be 
used.  The  Y  group  is  constituted  from  the  closed  variables.  We  will  use  X^~^[F{X,Y,Z)], 
which  would  be  created  by  the  above  CGF  if  V  =  V  at  the  time  the  CGF  is  evaluated,  to 
denote  the  closure  in  which  yi  =  vi ,  yj  =  vj , . . . ,  Vm  =  Vm  ■ 

An  example  of  a  CGF,  taken  from  the  synthesis  of  a  parallel  prefix  sununation  unit,  is 
II  VJ))].  The  semantics  of  this  is  that  when  the  form  is  evaluated 

its  value  is  a  closure,  which  is  a  functional  object  of  one  argument.  The  values  of  Vj,  Vj., 
Cl,  and  Cf  at  the  time  the  CGF  is  evaluated  are  "frozen”  into  the  closure,  and  when  that 
is  applied  the  values  are  used.  We  get  the  closure  ||  Cr{z  +  Vi))] 

when  we  evaluate  the  CGF  in  an  environment  in  which  Vi  =  a,Vr  =  b,  Ci  =  /,  and  Cr  =  g. 
We  see  that  Ci  and  Cr  are,  themselves,  closures.  The  CGF  cedis  for  the  creation  of  a  closure 
that,  when  applied,  binds  four  variables  (one  of  which,  Vr,  does  not  appear  in  the  form)  to 
four  saved  values  and  z  to  the  argument.  It  then  applies  z  to  one  of  the  saved  values  uid 
z  +  Vj  to  the  other  simultaneously. 

TRANSCONS  will  only  generate  a  restricted  subset  of  the  possible  closure  generating 
forms.  They  will  be  of  the  form  C  =  \G{Z,Vi,Vr,W)].  Vj  (resp.  Vj.)  are  the  values, 

including  closures  which  we  will  call  i.e.,  C|  (resp.  Cr),  received  from  left  (resp.  right)  chil¬ 
dren.  Vf  includes  locally  available  values  which  may  have  been  computed  during  previous 
upward  communication  phases,  and  G  is  of  the  form  G(Vj,Vr)(Z)  =  (Cj(G|(Z,Vi,  V,,  W))  || 
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C,{Gi(yt,Vr  II  G,{Z,Vt,Vr,W)))).  (Here  ||  is  concurrent  ^>plicstk>n,  and  Gi  and  Gr  can 
impose  side  effects  on  W.) 

We  want  to  show  that  if  the  closure  generating  form  has  this  property,  that  if  a 
generated  closure  can  be  applied  by  doing  a  small  amount  of  computation  and  applying 
other  similar  closures  in  turn,  then  application  of  the  generated  closure  is  fast.  For  conve¬ 
nience  we  will  call  this  type  of  closure  generating  form  a  tail  appUeable  form  (by  analogy 
with  the  term  ‘^ail  recursion”). 

Theorem  4.3  Suppose  a  eloaure  is  computed  in  the  root  of  o  balanced  Hnarp  tree.  That 
cloture  can  contain  eloauret  wkoee  kotte  are  its  children.  Tkoae  eloouree,  ta  turn,  can 
contain  cloeuret  wkoee  koeta  are  their  children,  etc.  The  leaves  of  the  tree  are  closure  hosts 
wkoee  closures  can  be  applied  in  time  0(1).  Eiaeh  leaf  has  a  distinct  index  such  that  the 
set  of  indices  of  leaves  is  exactly  a  range  of  integers.  The  set  of  indices  of  every  subtree’s 
leaves  is  also  a  range  of  integers  and  is  identified  by  the  endpoints  of  that  range,  and  the 
node  heading  the  subtree  whose  leaves’  indices  are  I  through  u  is  called  n”.  Suppose  all 
closures  computed  within  the  tree  are  tail  applicable.  If,  in  nf,  max(T(G),T(Gi),T(Gr))  = 

0(F(l  -u+  1)),  then  T(Cp)  =  0(F(n) log n). 

Proof:  Refer  to  taax(T(G),T(G,),T(Gr))  in  nf  as  r(n,“).  r(n;)  =  0(1),  so  caU  it  ik.  Say 
T(G)  <  eoF(u  -l+l).  We  prove  by  induction  that  T(nj*)  <  eoF(u  -l  +  l)  lg(u  - 1  -fl)  -I-  fc, 
where  co  is  the  constant  of  T(n7)  =  0(F(n)  log  n). 

The  base  case  is  immediate. 

If  /  ^  u  then 

r(fi}*)  =  max  ^r(n|^*^'*^^*‘*),r(»»“(j+u+i)/2-|)^  +  T(G)(definition,  nonsequentiality,  tree  balance) 
<eoF((u  -  I  +  l)/2)  lg((u  -  I  +  l)/2)  +  k  +  T{G)  (by  induction) 

<CoF(o  -  I  -f  1)  lg(u  - 1  -H)  -J-  h  (monotonic  F) 


87 


This  is  0(F(u  -l  +  l)  log(u  -  I  +  1)),  which  is  0(F(n)  log  n)  at  r(ny).  | 

Note  the  similarity  between  this  proof  and  that  of  Theorem  4.2 

Our  technique  will  be  to  reformulate  the  problem  from  that  of  creating  some  new  array 
that  is  a  function  of  an  existing  array  to  that  of  creating  a  closure  which,  when  invoked,  will 
perform  a  given  action  on  the  leaves  of  a  tree.  This  action  is  the  creation  of  an  element  of 
the  new  array  in  each  leaf.  The  original  specification  is  transformed  into  a  specification  that 
declares  the  existence  of  a  closure  that,  when  invoked,  will  satisfy  the  original  specification, 
followed  by  a  specification  that  the  new  closure  be  invoked.  In  the  synthesis  process,  the 
specification  that  a  closure  with  certain  properties  exists  is  transformed  into  code  that 
creates  such  a  closure.  This  code  has,  as  input,  closures  having  the  desired  property  in 
relation  to  subproblems  of  the  problem. 

Consider  the  process  of  combining  two  closures.  The  result  will  be  a  closure.  In 
with  closures,  it  is  normal  to  combine  a  pair  of  vectors,  each  containing  computed  values 
and  closures,  into  a  new  vector  with  similar  texture.  We  will  consider  the  computation  of 
a  closure  for  the  output  vector,  and  the  use  of  the  resulting  closures.  We  give  informal 
reasoning  to  justify  our  assertion  that  closure  generation  is  an  effective  technique  for  pro¬ 
ducing  tree  structures  by  D&G;  we  then  prove  formal  versions  of  the  informal  assertions 
we  make. 

The  combination  operator  can  operate  on  values  and  closures  from  the  input  vectors  to 
produce  a  new  closure.  The  code  before  the  closure  generating  form  (CGF)  may  compute 
values  that  will  be  included  in  the  closure,  and  the  CGF  itself  will  create  an  environment 
in  which  these  values,  as  well  as  other  values  and  some  closures  that  were  part  of  the 
input  vectors,  are  saved.  The  combine  operator  is  neither  bidirectional  nor  sequential  if  we 
have  reformulated  the  problem  properly,  and  it  avoids  rootlock  if  pairwise  combination  of 
results  from  vectors  produces  new  results  not  much  larger  than  those  that  were  combined, 
in  an  amount  of  time  not  much  longer  than  the  amount  of  time  used  to  develop  each 
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of  the  old  results.  For  our  purposes  we  would  went  these  smouBts  to  increase  at  nMst 
polylogarythnucally  in  the  length  of  the  string  incorporated  in  the  results. 

When  the  computation  uses  the  computed  closures,  each  host  will  have  eomputad  Um 
closure  it  hosts  by  computing  and  storing  some  values,  storing  received  closares  from  its 
children,  and  arranging  that  when  the  closure  is  implied,  formal  parameters  (if  any)  be 
bound  to  actual  parameters,  a  computation  be  performed  to  obtain  actual  parameters  for 
its  closures’  incorporated  closures,  and  such  closures,  in  turn,  be  applied.  The  computations 
of  a  closure  m\ist  meet  similar  conditions  to  those  met  by  the  combination  process  and 
described  above.  It  should  be  noted,  however,  that  the  incorporated  closures  can  be  applied 
in  parallel  and  that  there  is  no  need  for  the  calling  procedure  to  await  completion  of  any 
applications. 

We  have  exchanged  the  difficulty  of  reasoning  about  two-way  data  flow  with  the  need 
to  reason  about  closures.  We  feel  that  this  is  a  good  bargain  because  reasoning  about 
closures  only  requires  the  addition  of  new  axioms  to  a  theorem  prover’s  data  base,  while 
two-way  data  flow  requires  changes  in  the  way  we  look  at  D&C.  In  Section  4.2  we  showed 
that  this  change  of  view  costs  little  speed,  and  in  Section  4.5  we  show  that  no  expressive 
power  is  lost. 

We  conjecture  that  this  technique  can  bring  most  O(logn)  and  0(log’  n)  tree  parallel 
structures  within  the  reach  of  a  D&C-based  synthesis  method.  We  support  this  conjecture 
by  several  syntheses  in  the  next  Chapter  .  Since  a  tree-structured  processor  is  inexpensive 
to  manufacture  compared  to  more  highly  interconnected  machines  and  seenu  to  be  reason¬ 
ably  powerful,  we  feel  that  automatic  tools  that  make  use  of  this  family  of  topologies  easier 
would  be  an  important  contribution  to  the  technology  of  qrnthens  of  parallel  structures. 

In  summary,  the  technique  of  computing  closures  from  component  closures  is  a  tech¬ 
nique  which,  together  with  DSiC,  provides  the  ability  to  synthesize  a  wide  variety  of  tree 
structures  with  few  of  the  technical  problems  that  other  synthesis  methods  might  encounter 


concerning  reasoning  about  path  lengths  or  the  cardinality  of  sets  of  nodes.  It  allows  us 
to  do  this  and  to  still  produce  the  O(logn)  (or  0(log'  n)  for  small  t)  parallel  structures  we 
expect  from  trees.  . 

4.4  Transmission  of  Closures 

To  transmit  a  closure  from  one  processor  to  another,  it  is  not  necessary  to  transmit 
the  entire  program  and  all  of  the  environment  values,  provided  that  the  host  processor 
stands  willing  and  able  to  perform  the  work.  All  that  is  necessary  is  that  the  transmitting 
processor  send  a  token  of  some  sort.  The  receiving  processor  can  save  the  token  and  later 
use  the  closure  by  sending  back  the  arguments,  the  token,  and  control  information. 

When  this  is  done,  the  processor  sending  the  closure  (and  willing  to  do  the  work)  is 
called  the  closure’s  host,  and  the  receiving  processor  (which  has  a  license  to  use  the  closure) 
is  called  the  recipient. 

We  say  that  a  closure  is  live  if  there  is  a  possibility  that  it  will  be  invoked  at  a  given 
time.  A  closure  becomes  live  when  it  is  sent  and  remains  live  until  the  recipient  reaches  a 
point  in  its  procedure  past  which  it  can  not  invoke  the  closure.  We  will  have  more  to  say 
about  issues  concerning  the  liveness  of  closures  during  the  remainder  of  this  Section. 

Closures  can  be  efficiently  implemented  in  a  reasonable  machine  model.  Internally, 
a  closure  can  be  implemented  as  a  block  of  memory  locations  containing  a  “pointer”  to 
the  program  fragment  and  a  list  of  all  closed  variables  and  the  corresponding  values.  A 
pointer  to  the  block  could  be  used  as  the  token.  When  a  closure  is  applied,  the  recipient 
can  send  the  host  a  copy  of  the  token,  together  with  whatever  other  information  is  needed 
(primarily  the  argument(s)).  The  host  can,  using  the  received  token,  invoke  the  proper 
fragment  together  with  the  proper  environment  including  the  arguments  bound  to  the 
parameters,  by  using  the  information  contained  in  the  closure  and  message. 
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A  piece  of  program  text  (in  the  hoet  proceaaor)  that  creates  a  closure  is  called  a  closure 
fsnersttng  form  or  CGF,  and  a  piece  of  text  (in  the  receiving  proceaaor)  that  invokes  one 
is  called  a  cloture  invoking  form  or  CEP.  The  clasa  of  closures  generated  Iqr  <me  CGF  is 
a  family.  An  instance  of  the  family  of  closures  generated  by  a  specific  CGF  named  C 
will  be  called  a  C  instance  or  an  instance  from  C.  Members  of  a  family  differ  only  in  the 
environments,  since  the  code  will  be  the  same. 

The  required  data  transmission  can  be  reduced  in  cases  where  it  is  possible  to  infer 
various  things  about  the  use  of  a  closure.  For  example,  if  it  is  known  that  only  one 
instance  from  a  given  CGF  is  live  at  a  time,  the  host  needs  not  send  the  token,  but  only 
the  name  of  the  CGF.  That  name  would  not  vary  and  can  be  ‘assembled  into”  the  CIF. 
This  can  be  true  even  if  there  can  be  several  CGF  instances  for  a  given  CGF,  provided 
that  the  host  knows  in  what  order  the  recipient  will  use  the  closures  it  receives.  If  there  is 
only  one  CGF  in  a  processor,  and  only  one  instance  of  the  closures  that  it  generates  can 
be  live  at  one  time,  the  token  can  vanish-,  the  fact  that  the  ‘receiving”  processor  wants 
to  apply  any  closure  is  information  enough!  The  closure  has  been  completely  swallowed 
up;  infomoation  only  travels  from  the  recipient  to  the  hoet,  even  though  the  synthesis  was 
performed  as  if  data  flowed  only  in  the  other  direction. 

A  further  simplification,  of  interest  for  the  problem  of  synthesizing  parallel  structures 
that  will  later  be  reduced  to  VLSI,  is  available.  Suppose  the  following  conditions  are  met; 
Applying  a  closure  does  not  include  changing  state  in  the  host  processor.  (In  this  case, 
for  the  application  to  be  useful  it  must  cause  other  applications  in  the  host.)  Assume  also 
that  there  is  only  one  live  closure  in  a  given  family  at  any  time.  Assume  further  that  the 
values  used  in  that  closure  to  caU  other  closures  hosted  elsewhere  can  be  computed,  using 
only  values  available  to  the  host,  by  means  of  combinatorial  logic  (the  code  fragment  is 
loop>free  and  consists  only  of  operators  chosen  from  a  library  of  integrateable  operators). 
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In  this  case  it  is  possible  to  perform  the  closure  using  only  “combinatorial  logic”  in  the 
host  processor.  Specifically,  no  register  need  be  provided  to  hold  the  closure’s  parameter 
in  the  host  processor.  Instead,  logic  must  be  provided  to  map  a  signal  representing  an 
application  of  the  closure  to  signal(s)  representing  application (s)  of  the  subsequently  called 
closure(s).  Registers  are  provided  to  hold  all  of  the  values  of  the  closure.  An  example  taken 
from  the  Parallel  Prefix  structure  (whose  derivation  sketch  is  in  the  next  Chapter  )  will 
make  this  clear. 

We  have  the  code  fragment  to  synthesize  a  closure,  namely  1|  Cr(v/  +  z)]*. 

Here  we  will  observe  that  there  is  only  one  outstanding  instance  of  the  closure  at  any 
time,  because  the  variable  in  which  the  closure  is  stored  in  the  recipient  is  not  indexed. 
The  closure  does  nothing  more  than  apply  other  closures  to  a  function  of  its  argument. 
Furthermore,  the  computation  performed  on  the  argument  is  “easy”  (and  can  be  directly 
implemented  in  VLSI).  We  can  therefore  use  the  circuit  of  Figure  4.1: 

4 

■  I’-D 

4  k 

::] 

pnriial  sums 
arsend 

lef l  hail  sums 
a(  cudiulale  iii 
iiiiprual  nodes 
rf'nistci  s 

Figure  4.1:  Simplified  Parallel  Prefix  Internal  Node 

'For  clarity,  the  exposition  assumee  that  the  prefix  operation  is  addition,  and  that  an  addition  module 
exists  in  our  VLSI  module  library. 


4.5  Arguments  for  the  Completeness  of  Closures 


In  this  Section  we  formalize  the  notions  we  use  to  argue  that  restricting  communication 
to  the  upward  direction  in  trees  is  a  harmless  restriction,  not  preventing  the  synthesis  of 
tree  parallel  structures  to  meet  any  specification  that  could  have  been  met  absent  this 
restriction,  provided  only  that  we  also  allow  upward  communication  closures  and  that  we 
not  consider  the  application  of  a  closure  that  was  communicated  upward  to  be  a  downward 
communication. 

We  need  a  formal  definition  of  a  tree  parallel  structure,  and  in  order  to  do  that  we  need 
to  define  some  preliminary  concepts. 

Definition  4.2  We  define  a  binary  tree  in  the  usual  manner.  A  tree  either  has  a  node 
called  a  leal  or  a  node  called  a  root  connected  to  the  roots  of  two  subtrees  by  edges 
which  for  these  purposes  we  call  vires.  The  root  of  a  subtree  is  called  an  interior  node. 
The  connection  from  a  root  to  one  of  its  subtrees  is  called  the  lelt  connection,  and  that 
to  the  other  subtree  is  called  the  right  connection.  A  root  is  called  the  parent  of  the  two 
subtree  roots  it  is  connected  to,  and  the  connection  from  a  node  to  its  parent  is  called  the 
parent  connection. 

Note  that  each  wire  is  known  as  the  parent  connection  at  one  end  and  either  the  left  or 
the  right  connection  at  the  other.  This  defines  an  ordinary  binary  tree.  We  further  state 
that  there  is  an  ordering  of  the  leaves. 
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Definition  4.3  The  tet  of  ancestore  of  a  node  is  recursively  defined  as  the  union  of  the 
node’s  parent  (if  any)  and  the  parent’s  ancestors.  The  set  of  descendants  of  a  node  is 
recursively  defined  as  the  union  of  the  node’s  left  child,  right  child  and  their  descendants. 
The  set  of  left  descendants  (resp.  right  descendants^  of  a  node  is  the  left  reap,  right 
node  together  with  its  descendants.  A  tree ’s  leaves  are  ordered  if  the  leaves  are  indexed  by 
a  totally  ordered  set  and  if  the  index  of  leaf  A  is  less  than  that  of  leaf  B  iff  there  exists  a 
node  of  which  A  is  a  left  descendant  and  B  is  a  right  descendant. 

So  far  we  have  only  described  standard  binary  tree  structures  and  names  of  intercon¬ 
nections.  We  would  also  like  to  describe  a  computation  structure,  which  is  a  structure 
together  with  a  set  of  computations  and  communications  attached  to  each  node.  Each 
node  has  an  associated  program. 

For  our  purposes,  we  will  require  trees  to  be  homogeneous,  meaning  that  one  program 
will  be  run  in  all  of  the  internal  nodes,  and  another  single  program,  with  the  leaf  index  as 
a  parameter,  will  be  run  in  all  leaf  nodes.  Programs  may  do  reasonable  computations  and 
may  send  and  receive  on  the  attached  wires.  We  also  require,  however,  that  the  structure 
be  singly  buffered,  meaning  that  when  a  program  tries  to  receive  information  over  a  given 
wire,  it  will  do  nothing  else  until  the  program  of  the  node  at  the  other  end  of  the  wire  tries 
to  send,  and  when  a  program  tries  to  send  on  a  wire  twice  without  the  other  program  having 
tried  to  receive,  the  sending  program  will  do  nothing  else  until  the  other  program  tries  to 
receive.  Programs  may  perform  closure  application  with  no  regard  to  these  restrictions, 
but  the  transmission  of  the  closures  must  have  obeyed  these  conditions.  Programs  may 
test  whether  a  line  has,  or  can  accept,  data  and  therefore  avoid  wuting  if  it  can’t.  The 
situation  where  neither  program  at  either  end  of  the  wire  can  send  or  receive  is  possible, 
but  only  for  a  bounded  amount  of  time  (assuming  correct,  terminating  programming). 


Now  we  can  define  the  primary  parallel  structure  of  this  portion  of  TRANSCONS. 


Definition  4.4  A  tree  parallel  atmctTire  ftree  structure)  is  a  collection  of  processors 
together  with  programs  that  is  a  tree  with  numbered  leaves,  is  homogeneous,  and  is  singly 
buffered. 

We  need  a  definition  of  a  tree  parallel  structure  with  upward  communication  only; 

Definition  4.5  i4n  upward  tree  parallel  structure  is  a  tree  parallel  structure  in 
which  no  communication  is  specified  from  any  parent  to  any  child.  Closure  application 
is  not  regarded  as  communication  in  this  context. 

This  is  a  formal  definition  of  the  objects  described  by  TREES  declarations,  and  in  the 
rest  of  this  Section  we  will  explore  some  of  the  implications  of  this  definition.  In  particular, 
we  are  interested  in  an  assertion  that  limiting  communication  to  an  upwards  direction,  but 
allowing  closures,  gives  the  same  expressive  power  as  allowing  communication  in  both 
directions  but  not  using  closures. 

First  we  need  a  lemma. 

Lemma  4.4  Suppose  we  have  two  processors  A  and  B  with  two  wires  o&l  and  ab2  from 
A  to  B.  These  wires  obey  the  “singly  buffered’  condition  above.  It  is  possible  to  simulate 
those  two  wires  with  a  single  wire  with  no  more  than  a  constant  factor  speed  loss. 

Proof:  The  wire  is  driven  by  commands  of  the  form  read(a6t ,  x)  which  accepts  infor¬ 
mation  from  the  wire  ahi  and  stores  it  in  z  while  making  abi  no  longer  have  information 
to  offer;  send(afrt,z),  where  i  =  1  or  2  which  waits  for  the  wire  to  be  receptive  and  then 
puts  the  contents  of  z  on  the  wire  which  makes  it  unreceptive  but  gives  it  data  that  can 
be  read;  readable(a(t)  which  returns  true  iff  data  are  present  (and  can  be  read);  and 
sendable(a6t)  which  returns  true  if  data  can  be  sent  on  the  wire.  It  is  possible  to  replace 
forms  according  to  the  following  table: 


read(a6t,z)  becomes  (in  B) 


while  undefined(va6t)  do 
check{) 

X  *—  vofcl 

vabl  *—  undefined 

readable(abt)  becomes 
defined(va6t] 

send(a6t,z)  becomes  (in  A) 

while  defined(wa6t]  do 
checkQ 
wabi  <—  X 

sendable(a&t]  becomes 
undefined(u;a6t) 

Insert  “cheekD”  sufficiently  often  in  the  procedure  running  in  both  A  and  B  to  guar¬ 
antee  execution  periodically,  with  a  period  short  comp2U’ed  to  the  time  it  takes  to  com¬ 
municate  between  processors.  The  checkQ  call  in  A  checks  whether  wabl  and  wab2  are 
defined.  If  either  is  defined,  say  va6l,  checkQ  sends  the  pair  (l,wa&l)  over  the  wire  and 
does  wa&l  *—  undefined.  The  check  call  in  B  is  a  finite  state  machine.  In  its  initial  state, 
it  checks  whether  there  is  anything  to  read  on  the  wire;  if  there  is,  it  reads  it.  This  should 
be  a  number  i;  the  FSM  enters  a  state  S,.  If  checkQ  is  in  5,  ,  then  it  will  check  whether 
vabt  is  empty;  if  and  only  if  so,  it  will  read  the  next  object  from  the  wire  and  enter  the 
initial  state. 

Enumeration  of  the  sequences  of  actions  on  the  two  wires,  actual  and  simulated,  serve 
to  establish  correctness.  That  there  is  only  a  constant-factor  slowdown  can  be  derived 


96 


from  the  fact  that  eheekQ  does  a  constant  amount  of  work  unless  it  waits,  that  it  only 
waits  if  (and  as  long  as)  the  simulated  machine  would  have  waited,  and  that  it  replace  each 
communication  with  a  constant  number  (two)  of  communications.  | 

Now  we  can  prove  a  fundamental  theorem  about  unidirectional  communication  in  a 
tree. 

Theorem  4.5  Suppote  we  have  a  tree  parallel  etrueture  T  without  transmiseion  of  eloaurea. 
Then  it  ia  poaaihle  to  perform  the  aame  eompiUation  that  T  perform*  on  an  upward  tree 
parallel  structure. 

Proof:  Simulate  a  second  wire  from  the  left  (resp.  right)  child  to  its  parent  per  the 
previous  theorem.  Call  that  wire  Cl  (resp.  Cr)  where  it  impinges  on  the  parent  and  Cp 
where  it  impinges  on  the  child.  In  what  follows  we  describe  the  treatment  for  left  children; 
the  treatment  for  right  children  is  analogous.  The  parent  may  contain  Bend(Ieft,x)  and 
aexidable(left);  the  child  may  contain  read  (parent, z)  and  readable(pareot).  Each  of 
these  is  replaced  by  new  text  as  follows: 

8end(left,z)  becomes  (in  the  parent) 

read(Ci,C) 

C(x) 

sendable(left)  becomes 
readable(C() 

read(parent,x)  becomes  (in  the  child) 

while  undefined  (v)  do 
eheek{) 

X  «—  V 

V  4-  undefined 
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8end(Cp,A;[v  ^  z]) 


readable(parent)  becomes 


defined(t;) 


-v -^.v 


iL 


<-->s 


eheek{)  is  similar  to  that  of  the  previous  theorem.  Additionally,  the  fragment 


8end(Cp,A;[w^  z]) 


must  be  prepended  to  the  child’s  program  and 


I*.  *•**.  - 


read(Q,C) 

C{x) 

appended  to  the  parent’s. 

That  this  causes  correct  information  to  be  seen  in  the  recipient  is  evident  from  the 
observation  that  each  closure  is  used  to  send  exactly  one  value  to  the  recipient,  exactly 
that  value  is  used  as  an  argument  to  the  closure  as  was  previously  being  sent,  and  it  is 
only  used  once  (and  immediately  rendered  unde&ned).  That  this  causes  the  programs  to 
“hang”  at  exactly  the  right  times  can  be  easily  seen  from  the  fact  that  there  is  a  closure 
in  (say)  Cj  exactly  when  the  recipient  would  have  been  receptive,  and  there  is  a  value  in  v 
ex2u:t1y  when  there  would  have  been  a  value  available.  | 

The  key  point  to  note  is  that  all  downward  communication  is  expressed  as  closure 
application.  This  suggests  that  it  wUl  be  possible  to  express  a  problem  that  apparently 
can  not  be  solved  by  as  the  corresponding  problem  of  creating,  in  the  root,  a  closure 
that  has  a  desired  result  when  applied. 


».  ..  . 

...vV. 


We  have  therefore  shown  that  we  do  not  surrender  any  expressive  power  when  we  limit 
tree  declarations  to  upward  communication. 
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4.6  l^ees  of  Procesgors  in  TransConS 

Trees  of  processors  can  be  used  to  efficiently  implement  many  specifications  because  the 
tree  is  that  topology  with  fixed  arity  and  lowest  connectivity  that  allows  a  distinguished 
node  to  have  contact  with  all  other  nodes  in  0(logn)  steps,  which  is  clearly  the  best 
possible.  TransConS  ([KinSS])  therefore  has  facilities  for  specifying,  synthesizing  and 
manipulating  trees. 

The  description  of  a  tree  is  specified  in  TB££  declarations,  described  below.  Before 
describing  the  syntax  of  a  TREE  declaration,  we  will  describe  some  of  the  senumtics  we 
intend  for  it. 

The  trees  we  intend  to  address  are  used  to  shorten  the  longest  path  lengths  within 
the  collection  of  processors,  and  to  balance  the  workload  of  a  computation.  There  are 
problems  amenable  to  a  tree  solution,  portions  of  which  are  in  some  sense  more  important 
than  others  (for  example  Optimal  Binary  Search  IVees),  but  in  these  problems  there  must 
be  a  specification  of  relative  importance  that  has  a  size  comparable  to  the  size  of  a  good 
specification  of  the  solution.  We  will  therefore  model  solutions  to  problems  of  this  sort 
by  building  separate  trees  and  AGGREGATEing  them.  ESach  tree  described  in  a  single 
locution  will  be  balanced. 

Several  principles  govern  the  design  of  the  tree  system  of  TRANSCONS. 


•  All  trees  are  as  balanced  as  possible.  (We  use  binary  trees;  extensions  to  trees  of 
higher  arity  introduce  no  new  principles.)  No  flexibility  in  terms  of  shape  it  assumed, 
nor  ts  any  way  provided  for  expreeoing  shapes. 


•  A  tree  specification  must  include  a  size,  which  can  be  any  integer  greater  than  one. 


•  The  shapes  of  two  trees  of  the  same  size  are  identical.  That  is,  there  is  an  isomor¬ 
phism  a  between  two  trees  of  the  same  size  that  maps  parents,  left  children  and 
right  children  respectively  into  parents,  left  children  and  right  children.  There  are 
“compile-time”  constructs  in  the  TRANSCONS  language  that  allow  for  the  specifi¬ 
cation  of  connections  to  the  node  that  is  to  a  given  node,  or  AGGREGATION 
between  corresponding  nodes  of  different  trees.  One  way  to  achieve  this  identity  of 
shape  is  to  have  a  left-biased  tree  that  is  as  balanced  as  possible.  In  other  words, 
path  lengths  from  root  to  leaves  differ  by  at  most  one  and  if  one  such  path  is  longer 
than  a  second  the  first  path  must  be  to  the  left  of  the  second. 

•  The  nodes  of  a  tree  are  divided  into  three  groups.  They  are  the  root,  the  internal 
nodes,  and  the  leaves.  The  leaves  are  further  distinguished  by  indices.  References 
to  any  of  these  classes  of  tree  nodes,  either  to  attach  procedure,  to  specify  commu¬ 
nication  such  as  HEARS,  or  to  AGGREGATE  can  be  made.  Tags  are  provided 
for  a  node  to  refer  to  a  node  of  another  tree  that  is  to  it  if  the  two  trees  are  the 
same  size.  This  allows  nodes  in  ^-equivalence  classes  to  be  AGGREGATED  or  to 
HEAR  each  other.  For  this  to  work,  values  have  to  be  declared  properly.  Note  that 
a  leaf  has  to  offer  instances  of  values  that  are  HEARd  upward,  and  the  root  has  to 
offer  values  that  are  HEARd  downward. 

To  support  these  stipulations  we  have  the  TREE  data  type.  A  tree  is  declared  and  its 
components  laid  out  using  the  type  facility  of  CHI.  As  an  example,  we  will  describe  below 
a  situation  where  there  are  two  trees,  T  and  U.  Each  is  of  size  n.  Each  internal  node  of  T 
passes  a  value  to  its  children  after  having  multiplied  it  by  a  value  from  the  corresponding 
internal  node  of  U.  Each  internal  node  of  U  adds  values  from  its  two  children.  The 
procedures  at  the  leaves  of  T  and  17,  respectively,  are  described  by  functions  H  and  G,  not 
interpreted  here. 
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T  istype  TREE  (i)  size  n 

root  HAS  V  TALKS  leftaon  (SENDS  v) 

TALKS  rightaon  (SENDS  v) 

HEARS  aoiiree  (USES  outaide— value) 
HEARS  U.root  (USES  w-voiue) 
inter  HAS  v  TALKS  leftaon  (SENDS  v) 

TALKS  rightaon  (unakipSENDS  v) 
HEARS  parent  (USES  v) 

HEARS  [/.inter  (USES  u—vtdue) 
leaf  (0  HAS  U  HEARS  parent  (USES  v) 

U  istype  TREE  SIZE  n 

root  HAS  u  TALKS  T.root  (SENDS  u) 

HEARS  leftaon  (USES  v  as  v.left) 
HEARS  rightaon  (USES  v  as  v.right) 
inter  HAS  u  TALKS  T.inter  (SENDS  u  aa  u-value) 
HAS  V  TALKS  parent  (SENDS  v) 

HEARS  leftaon  (USES  v  as  v.left) 
HEARS  rightaon  (USES  v  as  v.right) 
leaf  (i)  HAS  v  TALKS  parent  (SENDS  v) 

HEARS  some,-  (USES  A) 


(in  T.root) 

V  *-  outaide— value  x  u— value 
(in  T.inter) 

V  *-  V  X  u— value 
(in  TJeafi) 

li  ^  H{v) 

(in  U.root) 

V  *-  v.left  +  v.right 
(in  U.inter) 

V  *—  v.left  +  v.right 
u*—  V 

(in  U.leafi) 

V  *—  Cr(Ai) 


Note  the  SENDS  u  AS  u-value  locution.  This  causes  a  value  to  be  known  as  u  in 
the  Intermediate  node  but  to  be  known  as  u-value  in  the  recipient. 


This  example  displays  three  features  of  tree  structures.  The  U  tree  has  upward  commu¬ 
nication  and  uses  v  to  name  links  on  which  information  flows  from  the  leaves  to  the  root. 
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There  is  also  cross  communication  using  u-vo/ue.  Finally,  there  is  downward  information 
flow  in  T  again  using  the  name  v. 

In  the  next  Chapter  we  explore  the  syntheses  of  several  tree  structures  in  detail. 


V 

•  k  -  ►  •  •  " 


Chapter  5 


Tree  Structures  Synthesis  Examples  and  Closure 
Removal 


In  this  Chapter,  we  will  consider  the  broadcast  problem,  the  prefix  summation  problem, 
and  a  part  of  one  solution  to  the  connected  components  problem  that  is  amenable  to  tree 
solution. 

Our  motivation  is  to  display  the  tree  synthesis  methods  of  TRANSCONS  in  some  detail. 
We  first  show  the  use  of  D&C  to  synthesixe  some  tree  structures  that  include  dosures.  We 
then  exhibit  closure  removal  techniques  that  convert  these  into  structures  that  can  be 
implemented  in  computation  models  that  do  not  permit  closures. 

The  division  step  requires  some  explanation.  Our  basic  technique  is  to  assert  that  there 
exists  a  closure  whose  action  is  to  make  true  the  required  first  order  logic  (FOL)  predicate, 
and  then  to  make  the  god  to  compute  this  closure  and  to  apply  it.  The  next  step  is  to  assert 
that  there  is  a  solution  to  the  problem,  if  there  is  a  solution  to  appropriate  subproblems. 
The  problems  are  expressed  as  input/output  behavior  on  vectors,  and  all  problems  and 
subproblems  are  concerned  with  (contiguous)  subvectors  of  the  problem  instance  vector. 
Different  methods  of  performing  the  division  step  result  in  different  tree  structures. 


Specifically,  we  have  the  hypothesis  3Cj*,  u*[action(C|'())  =  Vi  e  {/ . . .  tt}[P(0Il  = 
G(Cr',Q^x).  This  isn’t  strong  enough  for  us  to  synthesize  a  tree  structure,  because  we 
have  to  know  more  about  usable  values  of  u'  given  /  and  u.  If  we  have  3C7V/  <  u'  < 


u[action(C|‘())  =  Vi  e  {/ . . .  «}[/*(«)]]  A  C“  =  G(Cj**,C“i+i),  then  we  can  certainly  make  a 
balanced  binary  tree  (or  any  other  tree  structure  we  choose);  our  choice  of  u'  is  unrestricted 
(except  that  it  has  to  split  the  range  into  two  nonempty  pieces). 

Other  interesting  possibilities  include  . . .  (u'  =  /  +  1  A  action ...  or  ...  [u'  =  u  -  1  A 
action . . .,  both  of  which  would  create  trees  that  are  as  unbalanced  as  possible  (identical 
to  chains),  and  3C“,u'[u  —  /  <  3(u'  -  /)  <  2(a  —  /)  A  action . . .,  which  gives  us  a  tree 
structure  with  logarithmic  depth,  if  the  u'  can  be  found  at  compile  time.  These  cases 
provide  interesting  future  research  but  are  beyond  the  scope  of  this  thesis. 

5.1  Broadcast 

In  the  broadcast  problem,  value  (s)  known  in  a  central  location  are  distributed  to  many 
locations.  The  broadcast  problem  can  be  described  formally  as  Vi[oJ  F(o,-,i)]  or  perhaps 
Vy[Vi(aJy  ♦-  F(a,y,*y)JJ.  One  method  of  synthesizing  solutions  to  this  problem  might  be  to 
recognize  it  as  a  distinct  pattern  and  carry  a  synthesis  rule  that  produces  a  broadcast  tree 
when  supplied  an  instance  of  a  broadcast  problem.  Another  solution  is  to  produce  a  chain 
of  processors  as  a  bucket  brigade  to  distribute  the  information,  and  then  to  successively 
split  the  chain  in  half.  The  problem  with  this  solution  is  that  the  synthesis  process  is 
iterated  a  variable  number  of  times.  With  the  new  mechanism  of  closure  passing,  it  is 
possible  to  provide  more  general  rules  that  handle  broadcast  problems  as  a  special  case 
without  multiple  reformulations. 

Consider  the  application  of  D&C.  We  want  to  produce  a  closure  that,  when  applied 
to  I,  performs  Vi[aJ  *—  F(a,-,i)].  We  hypothesize  that  to  solve  the  problem  for  a  whole 
subarray,  we  can  solve  the  problem  for  each  of  two  pieces  of  the  subarray  and  combine  the 
two  solutions  in  some  manner.  Giving  the  names  //  and  fr  to  the  closures  for  the  left  and 
right  halves  of  the  problem,  and  fw  to  one  for  solving  the  whole  problem,  we  then  show  that 


to  combine  closures  fl  =  A*[Vj  €  rx(oJ.  ♦-  F(o,-,*))]  and  fr  =  AilVy  €  rj[o'.  *—  F(oy,i)]] 
we  have  only  to  create  /w  =  A^‘'^'*[//(y)  ||  /•’(y)].  We  go  through  the  following  sequence: 


VA.r3A'[Vi€[l...n]K  =  F(a,-,e)ll 

=*■  3C7(r)[actio]i(C(z))  =  VA,  z[Vt  €  (1 . . .  n][aj  =  ^’(at ,  z)]]]  (abstraction) 

hypothesis:  =  3<7“,  Vu', I  <  v!  <  u[action(C“(z))  =  VA,  z[Vi  e  [/ . . .  u] [aj  =  F{a,- ,  z)]  (division) 

A  cr(z)  =  G(Cr'(G,(z)).c:.+,(C.(z)))]] 


The  abstraction  step  is  the  step  of  asserting  that  there  is  a  function  whose  application 
makes  true  the  FOL  predicate  that  is  being  abstracted.  The  division  step  is  the  step  of 
asserting  that  it  is  possible  to  build  a  closure  that  solves  a  large  problem,  given  closures 
that  solve  subproblems  (and  possibly  other  data). 

In  this  case  it  is  possible  to  assert  that  any  u\  /  <  u'  <  u,  meets  the  requirements,  and 
that  in  this  case  a  balanced  binary  tree  solution  to  the  problem  will  certainly  work  because 
the  division  u*  =  u  +  f^l  will  provide  one. 

Setting  G(Ci(z),C3(y))  =  Ci(x)  ||  Ct{y)  (||  is  concurrent  evaluation)  and  Gi{x)  = 
Gr(x)  =  X  will  cause  the  division  hypothesis  to  be  true. 

It  only  remains  to  describe  the  procedure  for  handling  a  singleton  array.  This  is  the 
closure  AijoJ  F(o,,x)). 

The  computation  of  the  top  level  closure  is  O(logn)  where  n  is  the  size  of  the  problem. 
This  is  clear  from  the  theorems  of  the  previous  Chapter  and  from  the  observation  that 
T(G)  =  0(1).  (G  is  creation  of  a  closure  enclosing  two  given  closures.)  Similarly,  the 
time  consumed  by  an  application  of  the  top  closure  will  be  0(log  n)  from  the  fact  that 
m»x(T(Gi),T(Gr))  =  0(1).  (Gi  and  Gr  are  identity  operations.) 


5.2  Parallel  Prefix  Sununation 


Prefix  summation  is  a  mapping  of  arrays  onto  arrays  of  the  same  size  such  that  the 

element  of  the  output  array  is  the  sum  of  the  first  i  elements  of  the  input  array.  This 
generalizes  to  other  reduction  operations;  it  makes  sense  to  talk  about  prefix  product, 
prefix  and,  etc.  The  only  restriction,  shared  with  other  reduction  operations,  is  that  the 
operation  be  associative. 

Handmade  parallel  structures  that  solve  the  prefix  summation  problem  have  appeared 
in  the  literature.  See,  for  example,  [LFiSO]  and  the  more  recent  [Fic83].  Below  we  describe 
methods  to  synthesize  such  structures. 

5.2.1  Overview 

To  use  the  closure  technique  on  a  given  specification,  reformulate  the  problem  from 
something  like  ^X,. .  .3Y[P{X,Y)]  to  3C(VX,..  actlonCQ  =  P(X,y)].  Heuristically, 
the  problem  is  reformulated  from  that  of  satisfying  a  specific  input/output  specification 
to  that  of  producing  a  closure  wherein  the  I/O  specification*  will  be  satisfied  when  it  is 
applied. 

We  will  need  to  define  “augmented  prefix  summation  with  augend  z”  as  VI  <  i  < 
n[aj  =  z  +  ®>]-  We  then  say  that  the  task  is  to  deliver,  to  the  root  of  the  tree, 

a  closure  that  will  perform  augmented  prefix  summation.  To  create  a  closure  that  will 
perform  augmented  prefix  summation  with  augend  z  on  a  non-trivial  vector,  divide  it  into 
two  halves,  get  such  a  closure  from  each  half  together  with  the  grand  total  of  the  input 
values  for  that  half,  and  invoke  the  left  half’s  closure  with  z  as  an  augend  and  the  right 
half’s  with  the  left  half’s  sum  added  to  z.  We  deliver,  to  each  node  of  the  tree,  closures 
that  will  perform  augmented  prefix  summation  on  the  vector  comprising  its  leaves,  together 

‘More  precisely,  the  problem  of  satisfying  the  I/O  specification  that  requires  no  input  and  produces  that 
closure. 


with  the  leaves’  sum.  Note  that  the  closure  delivered  to  each  node’s  parent  has  to  include 
the  left  subtree’s  sum,  which  is  available  now  but  won’t  be  later.  A  more  formal  description 
follows. 

Assume  that  a  vector  is  divided  into  ^^her  assume  that 

we  are  trying  to  compute  which  we  will  denote  F^.  Further  assume  that  we  want 

to  have  some  effects,  local  to  the  array  elements.  We  would  therefore  want  to  compute  a 
closure,  C^,  that  would  have  the  desired  effect. 

The  generic  conibination  operator  for  the  values  is  F"  =  and  it 

is  a  synthesis  task  to  derive  the  properties  of  G.  Similarly, 

If  the  closure  has  an  argument,  the  situation  is  slightly  more  complex;  we  have  (^{z)  = 
G{Cr'(Gi{z,  F,“',  /, «,  u')).  F:.^^  where  the  F  vec¬ 

tors  are  the  values  available  to  (and  incorporated  in)  C”.  This  general  schema  need  only 
be  used  with  specific  combiners  (i.e.,  G,  G|,  etc.).  As  a  simple  example,  prefix  summation 
can  be  performed  by  this  schema  if  G  s  (Cu/t  ||  Cright)  (where  ||  is  concurrent  applica¬ 
tion),  G((z)  =  z,  and  Gr{z)  =  z  +  vi-  v,  in  turn,  is  computed  as  vj  -I-  v,.  Singleton  v-  and 
G-expressions  are  G,-  s  As  [oj  o,  +  s)  and  =  Oi. 

5.2.2  Derivation 

In  prefix  summation,  the  specification  to  meet  is  Vi  €  (1 . . .  n}[oJ  «-  o,].  We 

will  introduce  the  abbreviation  £{*  =  .*}  oy.  This  then  becomes  Vi  €  {1 . . .  n}[oJ  «- 

We  change  the  specification  to  one  requiring  the  computation  of  a  closure  which, 
when  applied  to  no  arguments,  performs  this  action,  together  with  the  application  of  that 
closure. 

VA3A'  V.€{l...n}  ‘*:  =  E]] 

^  3GVA  action(G())  =  Vi  €  {1 . . .n)  |oJ  =  Ej j  (abstraction) 
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hypothesis:  =  |^action(C“())  =  Vi  e  .  u}  j^oj  =  ^ 

AC,“  =  G(q“'().c“.+i())] 


(division) 


But  action(C“i^j())  =  Vi  e  {u'  +  l...u}[o;.  =  Ei  '+i]  ^  impossible  for  G  to  do 
anything  to  to  make  its  action  ...oj  =  must  have  a  parameter  in 

order  to  allow  G  to  provide  enough  information. 

We  modify  the  closures  so  instead  of  action(C“())  =  ...  we  have  act!on(C“(z))  = 
Vi  e  {/ . . .  u}[a|  =  /, i)|.  We  do  not  yet  know  the  properties  of  H. 

We  now  have: 

action(C,“(z))=  Vi  e  {/ . .  .u}  a\  =  H 
action(C,*‘'(z))=  Vi  e  {/ . . .  u'}  a[  =  H  z,  /,  i  j  | 

action(C“r+i(z))=  Vi  €  {u' +  1 . . .«}  |o-  =  F  ,z,/,i^  j 


So  we  observe 


action(C^(z)) 


=  ^Vi€{/...u}  a;.  =  fr^^,z,/,ijjj 

=action(G(C,“'(G,(z,/,i)),C“,+i(G,(z,/,i)))) 

iG^  ^Vi  e  {/...u'}  |a'  =  H  ^^i,G,(z,i,i),/,i^j^  , 

l^Vi  e  {u'  +  l...u}  =  H  ^^,G,(z,/,i),/,ijj  j  j(z) 

=Vi  €{/...«'}  a;  =  .ff  z, /, i j  I  A  Vi  e  {«'  +  1 . . .  u}  |o;  =  h{^,z,1,  i^ 


(above) 

(DiiC) 


(expansion) 
(V  identity) 


Assuming  G  merely  generates  a  closure  to  produce  application  of  both  of  its  parameters, 
then  =  H(Et,G,(z,l,i),l,i)  and  ff(El,z,l,i)  =  ff(EWi,Gr(z,l,i),l,i). 

The  first  solves  as  z  =  Gi{z,l,i).  The  second  needs  a  bit  more  attention.  If  we 


represent  as  Ei* +Ei'+i.  that  H{'£,\,z,l,i)  =  H(]C“* +  5X'+i.*.^0  = 

«>  B{r  +  q,z,l,i)  =  B{q,Gr{z,l,i),l,i)  where  r  = 

Letting  IT  =  Az,y,z,u;[y  +  z]  we  get  G,  =  +  ^]-  This  can  lead  to  two  problems. 

One  is  that  H{x,y,z,z)  x  must  be  satisfied,  so  it  must  be  possible  to  satisfy  y + z  =  z  for 
some  y.  At  this  point,  if  there  is  no  identity  to  the  operation,  we  have  to  say  Gi{x,z,z)  = 
(a  special  value),  and  H  =  Az,y,z,w[y  0  z]  where  (special  value)  0  z  =  z.  The  other 
problem  is  that  there  isn’t  enough  information  around  to  compute  Gf.  We  have  to  expand 
the  problem  again  to  bring  about  the  availabUity  of  intermediate  values  for  the  intermediate 
closures.  In  this  case  we  need  Instead  of 

action  (CrW)  =  *ction  (G(cr'{^;i(z./,0).<^^+i(Gr(z./,0))) 

we  want 

and 

BCtion{Cri^))  H  action  (G(cr'(G^(«r^C+l,*)).C'“,+l(Gr(.»r^v^.+,,z)))  . 

Taking  a  more  intuitive  view  for  the  moment,  we  observe  that  we  want  to  compute 
a  pair  (v”tG")  in  which  v"  =  and  in  which  action(G"(z))  is  the  computation  of  an 
augmented  prefix  tummation^  where  oj  ♦-  z  -f  instead  of  aj  ♦- 

We  want  Gr(v“*, z)  =  z  +  S“'>  ■<>  ^  must  use  vf  =  or  wj*  =  5^j*. 

We  lack  only  one  step  to  a  complete  solution.  Initially  we  wanted  to  compute  a  closure 
which,  when  computed  for  that  ’’sub-array”  which  is  the  whole  array  and  applied  to  no 
argument,  computes  the  prefix  sum.  We  will  get,  instead,  a  pair  of  results.  One  of  the 
results  is  a  value,  and  the  other  is  a  closure  which,  when  applied  to  one  value,  computes  a 
generalization  of  the  prefix  sum.  It  remains  to  convert  this  back  into  a  closure  that  can  be 
applied  to  no  arguments. 
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We  have 

action(C“(z))  =  Vi  €{/...  u} 

and  we  want 

action(F'())  =  Vi  €  {1 ...  n} 


a\  =  Y^  +  z 
i 


1 


=  actioii(Ci  (z))  =  Vi  e  {1 . . .  n} 


=  Z  +  ■^ 

1 


for  some  z.  Clearly  z  =  0  works.  The  operation  will  always  have  an  identity  because  the 
creation  of  H  will  require  a  new  operation  to  be  created  if  the  basic  operation  lacks  one. 

5.2.3  Derivation  Summary 

We  have  taken  all  of  the  following  steps; 


c 


•  We  transformed  an  I/O  specification  whose  input  and  output  were  vectors  into  an 
I/O  specification  taking  vectors  into  a  closure; 

•  We  have  substituted  the  I/O  specification  into  a  general  D&C  scheme; 

•  We  established  that  the  subclosures  need  an  argument  to  fill  their  role  in  the  closure 
that  is  being  computed  and  modified  the  specification  to  reflect  this; 

•  We  augmented  the  specification  to  compute  another  value  that  is  needed  to  compute 
the  argument  that  subclosures  will  need; 

•  We  performed  backwards  inference  to  determine  the  I/O  behavior  of  generic  functions 
from  the  DAcC  formulation;  and 

•  We  returned  to  the  original  problem  of  computing  a  closure  to  solve  the  original  prob¬ 
lem  in  terms  of  the  new  specification  that  specifies  a  closure  that  is  a  generalization 


of  the  function  desired. 
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In  slightly  more  detail,  the  st^ps  are  as  shown  below: 
action(C())  =  VI  <  i  <  n 

action(C/'0)  =  V/ 

=  V/  <  I  <  u' 


X 


<.<«U=e| 


s  V/  <  t  <  «' 


( 1 


A  Vu'  +  1  <  I  <  u 


A  Vu'  +  1  <  I  <  u  a? 


.  Vti'  +  1  <  *  <  u  aj  =  E  j 


We  must  supply  a  new  parameter: 

action(C(s{)))  =V1  <  i  <  n  |o'  =  F 
action(C,“(s))  Haction(C,“*(G,(z)),C7^+i(G,(z))) 


=  ^(E“'  +  Ei'+i.*.^»)  =  ^(Ei'+i.<^r(s),l,0.  ^l»>ch  works  if 
H{x,  y ,  s,  w)  =  z + y  and  Gr (s)  =  Sj*’  +*,  but  the  latter  requires  having  5^“'  +*  available. 
We  therefore  further  modify  the  problem  by  requiring  the  collection  of  another  value. 


The  last  observations  we  need  (the  base  case)  are: 


v<=  E  = 


i  . 


Ci=  As 


Vi  <  y  <  I 


i;  =  s  +  E|]  =  A*[o;  =  *  +  a<] 


We  therefore  have  J{x,y)  =  x  +  y  making  0“  =  vf  +  C“(*)  applies  CJ**  to  r, 

and  to  z  +  vf.  Creating  new  symbols  for  the  values  (vj,  Vr  and  v)  and  closures  (Cj, 

Cr,  and  C)  received  from  the  subproblems  and  passed  to  the  superproblem,  we  finally  get 
the  following: 

J{x,y)  =x  +  y 

V  =Vt  +  Vr 

vJeafi  =ai 

G{Cl,Cr)  =CliGi{z))  II  Cr{Gr{z)) 

Gi(z)  =z 
Griz)  =Z  +  V| 

CJeafi  =Xz[a\  =  z  +  a,] 

C.r6oi  =A()^‘-^-"*(G(C,,C,)(0)l 


We  have  G(Cj,C,)  =  Cj(Gj(z))  ||  Cr{G,(z)).  We  would  therefore  have  a  synthesized 
TREE  declaration  to  read,  in  part, 


inter  HAS  C,  v 

HEARS  Uftzon  (USES  C  as  C|,  USES  v  as  vi) 
HEARS  rightzon  (USES  C  as  Cr,  USES  v  as  Vr) 
TALKS  parent  (SENDS  C,  SENDS  v) 


and  the  program  for  the  internal  nodes  to  read,  in  part. 


(in  T.inter)  : 

C  4-  A:«•‘’'•^‘■^'[C,(G,(^))  II  Cr{Gr{z))] 

where  Gi{z)  =  z 
where  Gr{z)  =  z  +  vi 
V*-  Vi +  Vr 


This  can  be  converted  to  a  tree  structure  free  of  closures  by  simple  rewrite  rules  to  be 


described  in  the  next  Section. 


6.S  Closure  Reduction 


The  parallel  structure  that  results  from  the  parallel  prefix  synthesis  contains  closure 
generating  forms  and  closure  applications.  Permitting  this  makes  the  D&C  synthesis  pro¬ 
cess  work — but  the  resulting  structure  is  not  a  desirable  one  because  the  multiprogramming 
model  must  be  used.  Closure  generating  forms  (CGF's)  are  not  part  of  the  lower  levels 
because  they  require  the  processors  to  be  capable  of  efficient  context  switches.  Even  if  the 
multiprogramming  model  is  satisfactory,  there  is  one  other  reason  for  developing  technol¬ 
ogy  to  eliminate  closures  from  parallel  structures.  Any  direct  implementation  of  closures 
requires  two  messages  to  be  sent  in  each  of  two  directions:  the  closure  from  the  host  to 
the  recipient,  and  the  application  message  from  the  recipient  to  the  host.  This  is  waste¬ 
ful  because  the  eloaure  earriea  no  information  beyond  collation  (matching  reeipient-to-hoat 
meaaagea  with  eorreaponding  hoat-reaident  data)  and  indication  of  readineaa. 

Both  collation  and  readiness  information  are  redundant  in  some  parallel  structures  such 
as  the  one  we  have  just  synthesised  for  parallel  prefix.  Even  where  collation  is  necessary 
we  will  see  that  it  can  be  realized  by  sending  data  only  from  the  recipient  to  the  host. 
For  all  of  these  reasons  we  are  motivated  to  provide  transformation  rules  that  remove 
closures  from  a  specification,  replacing  them  with  equivalent  transmissions  of  argument 
data  and  possibly  collation  data.  We  call  the  process  of  transforming  a  paraUel  structure 
that  includes  closures  into  an  equivalent  one  that  does  not,  eloaure  reduction. 

There  are  two  cases  to  consider:  either  there  can  be  more  than  one  live  CGF  instance 
at  once,  or  there  can  not.  We  will  call  the  first  type  of  CGF  a  multiple  CGF  and  the  second 
a  aimple  CGF.  It  is  only  required  that  the  recipient  send  the  host  coUation  information  to 
apply  an  instance  of  a  multiple  CGF,  not  of  a  single  one.  We  will  see  below  that  the  same 
reduction  methods  can  be  used  for  the  two  cases,  with  the  redundant  transmission  being 
removed  by  further  processing  of  the  parallel  structure  that  results  from  closure  reduction. 


5.3.1  CGF  Reduction  Rules 


We  describe  an  example  of  a  case  in  which  a  CGF  would  support  multiple  instances.  A 
two  dimensional  array  of  numerical  values  will  be  available  from  an  external  source.  This 
array  will  be  described  as  having  m  rows  and  n  columns  in  what  follows.  The  information 
will  be  available  a  row  at  a  time.  Each  row  has  some  maximal  elements  and  some  non- 
maximal  ones.  The  problem  is  to  determine,  for  each  column,  the  sum  of  all  elements  that 
are  maximal  in  their  row. 


There  is  a  simple,  fast,  and  memory-efficient  tree-structured  solution  to  this  problem. 
A  processor  is  assigned  per  column,  and  the  column  processors  are  connected  by  a  binary 
tree.  As  the  rows  of  the  array  arrive,  the  structure  initiates  a  maximization  calculation  to 
determine  the  maximal  element(8)  of  the  row  and  send  back,  to  those  columns*  processors 
that  had  maximal  elements,  a  copy  of  that  element.  Each  processor  sums  the  replies  it 
receives.  It  is  possible  to  pipeline,  i.e.,  to  initiate  subsequent  maximizations  before  the 
results  return  from  the  first. 


We  reformulate  the  problem  to  that  of  computing,  for  all  i  in  range,  Ci  =  Az[Vy  £ 
(l...n)(o,-if6<,=  *  then  aj  -|- 1  else  oy]]  and  «,•  =  max^g^j  6,*  together  with  the 
application  C,  (v,]  for  ail  t  in  range.  We  wiU  use  a  division  step  in  which  substrings’  closures 
are  both  applied  either  to  the  value  of  the  maximal  columns  if  the  substring  contains  at  least 
one,  or  0  if  the  substrmg  contains  no  maximal  columns.  We  eventually  get  the  following 
parallel  structure: 
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Inter  HAS  C,-,  1  <  i  <  m 

HEABS  leftaon  (USES  C„  1  <  i  <  m  a*  C/.,  USES  t>„  1  <  t  < 
m  as  v/j) 

HEARS  n^Ateon  (USES  Cj,  1  <  t  <  m  as  Cr^,  USES  Vi,  1  <  t  < 
m  as  vr,) 

TALKS  parent  (SENDS  C,,v,-,  1  <  «  <  m) 

(in  T’.inter)  ; 

VI  <  I  <  m 

Uj  ♦—  F{yli,vri,CHi,Cri^ 

Vi  G(vli,vri) 

define  F{va,vb,Ca,Cb) 

(return  z  =  va  then  Ca{z)  else  Co(0) 

II  if  z  =  va  then  Ca(z)  else  C7a(0))]) 

define  G[va,vb) 

(return  max(va,vfr)) 

(in  TJeafj,  l<j<n): 

.,-0 

VI  <  »  <  m 

*  Oij 

Ci  AJ'  lif  z  SI  Vi  then  *r-  gj  +  z] 

(in  T.rooi) : 

VI  <  I  <  m 

Ci  *-  F{vli,vri,Cli,Cri) 

Vi  *-  G(vli,vri) 

Ci(vi) 


C 


.■V-.-S 


In  the  psrallel  structure  above  we  see  that  TRANSCONS  must  make  certain  changes. 
The  transmission  Ci  <—  (CGF)  where  is  a  communication  variable  must  be  changed 
into  a  mapping  (in  an  appropriate  map  variable)  of  the  point  (t,  (CGF  name))  — »  V  where 
V  is  the  vector  of  relevant  values  that  are  enclosed.  A  closure  application  Ci(z)  must 
be  changed  into  transmission  of  the  triple  (t,  (CGF  name),z)  to  the  host.  The  host  must 
include  a  new  process  whose  procedure  is  the  procedure  portion  of  the  closure  augmented  to 
take  as  input  the  values  sent  according  to  the  previous  change.  A  new  pool  must  be  created 
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Ik  .  • 


V 


to  accomodate  transminions  of  triples  to  the  host,  and  the  pool  that  carries  closures  to 
the  recipient  can  be  removed. 

We  also  make  the  cosmetic  change  of  expanding  the  function  definition  (7(va,v6)  in 
line.  This  definition  results  from  the  method  we  use  of  inserting  a  D&C  scheme  instance 
first  and  determining  the  necessary  properties  of  the  included  functions  later,  and  it  will 
not  be  inserted  in  later  examples.  This  example  becomes 


inter  HAS  C«,  1  <  t  <  m 

HEARS  leftaon  (USES  1  <  t  <  m  as  vli) 

HEARS  rightaon  (USES  v,-,  1  <  t  <  m  as  vr^) 

TALKS  parent  (SENDS  1  <  t  <  m) 

TALKS  leftaon  (SENDS  wli,  1  <  i  <  m  as  Wi) 

TALKS  rightaon  (SENDS  u/r,-,  1  <  t  <  m  as  w,) 

HEARS  parent  (USES  lUj,  1  <  t  <  m) 

(In  T.inter)  : 

Vl  <  I  <  m 

Mi  *-  {A,vli,vri)  ;  thia  is  a  map  aaaignment  that  ereatea  a  "doaur^ 

IV  «-  max(v/,,  wr,) 

(In  T.ifiter)  : 

Vl  <  I  <  m 

F*{i,  w,)  ;  thia  awaita  "doaure  applieationa" 

define  F'(i,ww)  ;  when  a  "closure  application*  arrivea... 

K  M.i  =  A  then 

let  s  =  ww,  va  =  Mij,  vb  =  M,,s  do 

wli  ^  (If  z  =  va  then  x  else  0)  ;  "apply* 

wri  «—  (If  z  =  v6  then  z  else  0)  ;  contained  doaurea 

Mi  undefined 

(In  TJeafj,  1  <  j  <  n)  : 
s,-0 
Vl  <  I  <  m 

Vi  Oij 

Mi  =  iB,Vi) 

Vl  <  I  <  m 

let  ww  —  Wj  do 
if  Mi,i  =  B  then 


let  X  =  «vwi  do  If  jr  =  then  Xj  «-  ey  +  < 

Mi  4—  undefined 
(In  T.roat)  : 

VI  <  I  <  m 

Mi  4-  (vli,vri) 

Vi  4—  max(v/i,vr,) 

F'(i,Vi) 

Here  A  ia  the  name  given  to  the  CGF  «  =  va  then  Ca(z)  else  Ca(0)  || 

if  z  =  v6  then  C6(x)  else  C6(0))]”  and  B  the  name  of  “A"*' [if  z  =  v,  then  Xj  *—  Xj  +  z]”. 

The  enabling  conditions  of  the  transformation  rule  are  that  a  CGF  exists,  and  that  a 
communication  variable  exists  that  must  be  assigned  an  object  of  type  closure.  The  CGF 
is  given  an  arbitrary  name.  Creation  of  a  closure  is  replaced  by  creation  of  a  tuple  giving 
the  name  of  the  CGF  and  the  values  of  those  elements  of  the  closure  that  occur  free  in  the 
procedure  of  the  CGF.  Assignment  of  the  closure  to  a  communication  variable  is  replaced 
by  entering  the  indices  and  name  (if  necessary)  of  the  variable  into  a  map,  M,  that  maps 
it  into  the  tuple  that  represents  the  closure.  We  add  to  any  processor  that  contains  one  or 
more  CGF’s  a  process  that  awaits  communications,  decides  which  CGF  it  was  based  on  by 
using  M  sets  up  the  environment  also  stored  in  M  and  performs  the  procedure.  A  closure 
application  is  replaced  by  a  transmission  of  index  and  argument  information  to  the  host. 

If  a  CGF  is  situated  so  only  one  instance  can  be  live  at  a  time  (determined  by  data 
flow),  further  simplification  is  possible.  The  simplification  process  begins  as  above,  but 
after  the  closure  passing  and  application  is  reduced  the  index  portion  of  the  transmission 
resulting  from  the  closure  application  can  be  determined  by  flow  analysis  and  need  not  be 
included.  It  can  also  be  determined  that  the  map  will  never  contain  more  than  a  single 
element — so  map  insertion  can  be  replaced  by  assignment  of  a  variable  and  map  retrieval 
by  reference  to  the  variable. 


Below  we  show  the  parallel  structure  for  parallel  prefix  sununatiou  of  n-element  vectors 
before  and  after  closure  reduction.  Observe  that  the  possibility  of  multiple  CGF  instances 
does  not  arise  because  within  any  one  processor  a  CGF  is  only  used  once  per  computation. 


inter  HAS  C 

HEARS  leftaon  (USES  C  as  Cl,  USES  v  as  vl) 
HEARS  rightaon  (USES  C  as  Cr,  USES  v  as  vr) 
TALKS  parent  (SENDS  C,  SENDS  v) 

(in  T.inter)  : 

C  *-  F(vl,vr,Cl,Cr) 

V  *—  vl  +  vr 

define  F{va,vb,Ca,Cb) 

(return  A|;“'‘’*-^*'^*[(Co(s)  ||  C6(va  +  s))]) 

(in  TJeafj,  1  <  i  <  n)  : 

V  *—  ay 

C  *—  ^  [a'-  *—  z  +  ay] 

(in  T.root)  : 

C  ^  F{vt,vr,Cl,Cr) 

0(0) 


The  result  of  the  closure  removal  process  is 


inter  HAS  C 

HEARS  leftaon 
HEARS  rightaon 
TALKS  parent 
TALKS  leftaon 
TALKS  rightaon 
HEARS  parent 

(in  T.inter)  : 

M  (.4,  vl,  vr) 

V  r-  vl  +  vr 
(in  T.inter)  : 


(USES  V  as  vl  as  vr) 
(USES  V  as  vr) 
(SENDS  v) 

(SENDS  wl  as  w) 
(SENDS  wr  as  w) 
(USES  w) 
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r((),u,) 

define  F'(i,ww) 

If  Ml  =  X  then 

let  X  =  wtff,  va  =  Afj,  vb  =  M3  do 

wl  *—  X 
wr  *—  va  +  X 
M  4—  undefined 

(in  TJeafj,  1  <  i  <  n)  ; 

V  4—  aj 
M4-(B,o,) 

let  ww  =  w,  do 
if  Ml  =  B  then 

let  X  =  tvwi,  aa  =  M3  do  a*-  4—  3  4-  oa 
M  4—  undefined 
(in  T.root) : 

M  4—  (yl,  vr) 

V  *-  vl  +  vr 

rco.v) 


The  fact  that  the  index  into  the  map  can  only  take  a  single  value  and  is  therefore 
redundant  is  immediate  here,  because  the  index  is  a  vector  of  length  sero!  In  other  cases, 
preexisting  value  flow  techniques  such  as  those  of  [KenSl],  [SPnSl]  and  [CRiSl]  would 
be  used  to  establish  this  fact. 

6.4  Connected  Components 

We  now  explore  another  specification  that  raises  some  additional  issues  about  DI2C  to 
synthesize  tree  structures  and  about  closure  removal.  In  this  structure  use  of  a  closure 
causes  another  closure  to  be  sent,  because  the  use  of  a  closure  adds  an  element  to  a  set 
that  is  being  built  up  and  this  can  be  done  repeatedly.  We  describe  a  closure  removal 
technique  that  copes  with  this  complication,  and  we  sketch  two  possible  implementations. 
The  problem,  together  with  one  of  the  implementations,  is  described  in  [HMS84]. 


•  « 
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The  problem  is  to  find  the  connected  ctunponents  of  s  graph,  given  an  adjacency  matrix 
(a  matrix  A  in  which  Oij  =  true  iff  node  t  is  (directly)  connected  to  node  j  in  the  graph). 
The  adjacency  matrix  will  be  available  for  input  one  row  at  a  time,  and  a  solution  is 
preferred  that  reads  the  rows  at  equal  intervals. 


In  this  Subsection  we  will  derive  a  tree  structure  that  solves  part  of  the  problem  and 
meets  certain  worst  case  time  constraints.  The  derived  structure  will  operate  while  the 
rows  of  the  adjacency  matrix  are  read  in. 


Formally,  we  will  assume  that  there  exists  a  source  of  rows  of  the  adjacency  matrix  that 
can  provide  one  row  at  a  time.  Each  column  will  be  read  by  its  own  processor.  Columns 
and  rows  have  integers  in  the  range  [1,2, . . .  ,n]  as  names.  When  column  t’s  processor  reads 
row  j  it  receives  the  value  true  if  there  is  a  graph  edge  between  t  and  j  or  false  otherwise. 
The  network  we  derive  will  then  store  the  information  in  such  a  manner  that  it  or  some 
other  network  can  identify  connected  components  of  the  graph  whose  adjacency  matrix 
was  read. 


The  adjacency  matrix  contains  d(n’)  bits,  and  any  system  capable  of  storing  this 
amount  of  information  must  obviously  occupy  proportional  area.  We  would  like  to  perform 
filtration  or  reduction  of  the  n’  bits  of  information  into  n  log*  n  for  constant  t ,  in  order  to 
make  a  more  compact  implementation  of  a  circuit  possible. 


The  column  processor  nodes  of  the  network  must  read  elements  of  the  rows  of  the 
adjacency  matrix  at  such  a  time  (in  relation  to  the  time  other  processors  read  their  elements 
of  the  same  row)  that  the  network  will  not  confuse  elements  of  different  rows  of  the  matrix, 
and  the  net  must  build  a  representation  of  the  (partial)  connected  components  information 
in  some  useful  manner.  The  representation  should  be  compact  and  the  computation  should 


.■V  .'V  .> . 


be  fast. 
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First  we  will  derive  the  structure  up  to  one  importsnt  implementstion  decision;  then 
we  will  describe  the  two  resulting  parallel  structures. 


5.4.1  Derivation  of  a  Tree  Structure 


In  the  connected  components  problem,  we  do  not  necessarily  want  to  change  the  state 
of  the  leaves  of  the  tree  or  develop  a  value  at  the  root.  Instead,  we  want  to  change  some 
state  so  questions  about  connected  components  become  easier  to  answer. 

We  will  use  the  notation  CC(i)  to  denote  the  set  of  nodes  in  the  same  connected 
component  as  the  node  t.  CC'{N)  is  a  predicate  indicating  whether  aU  nodes  of  N,  a  set  of 
nodes,  are  in  a  single  connected  component.  Since  the  state  of  knowledge  of  the  connected 
components  of  a  graph  can  vary  with  time  and,  in  a  multiprocessor  system,  with  location, 
we  will  later  introduce  other  variants  of  the  CC  predicate. 

We  will  read  the  rows  of  the  adjacency  matrix  one  by  one.  After  we  have  read  all  of  the 
rows  we  will  then  engage  in  another  computation  not  described  here,  to  put  reduceminfy  : 
j  e  CC{i)}  in  leaf  t.  In  what  follows,  we  will  call  the  processing  that  takes  place  between 
the  reading  of  consecutive  rows  of  the  matrix  a  phase. 


There  are  several  solutions  to  the  connected  components  problem  which  we  reject  be* 
cause  they  have  certain  undesirable  features.  One  solution,  for  example,  would  be  to  have 
each  node  record  the  row  numbers  of  all  rows  of  the  adjacency  matrix  in  which  it  is  men¬ 
tioned.  This  would  require  a  lot  of  storage.  Another  solution  is  to  have  each  leaf,  after 
each  row,  find  reduceoii„{y  :  j  e  CC(t)}  so  far.  The  problem  with  this  solution  is  that 
the  time  between  the  reading  of  rows  can  vary  over  a  wide  range  (see  [LiVSl]). 
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Our  derivation  requires  a  certain  amount  of  invention.  We  will  assume  that  the  user 
provides  this  by  defining  several  intermediate  predicates  and  by  providing  some  informar 
tion.  Two  ideas  are  involved  in  our  conception;  the  idea  of  a  map  to  store  the  state  of  the 
connected  components  so  far,  and  the  idea  that  the  map  is  limited. 


We  start  with  axioms  about  connected  components; 


CC'({e}) 


CC'(A)  A  CC'(B)  A  Afl  B  yt  0  =►  CC'CAU  B) 


CC'(A)  aA'cA=>  CC'(A') 


We  observe  that  the  following  is  true; 


CC'(A)  A  CC7'(B)  A  3o,  6[o  €  A  A  6  G  B  A  CC'({o,  b})]  =>  CC'(A \J  B) 


First,  we  supply  TRANSConS  with  a  divide-and-conquer  formulation.  In  what  follows, 
V  will  be  a  set  of  connected  components,  each  of  which  is  a  set  of  graph  nodes;  W  is  a 
connected  component  or  a  subset  of  one;  CC  {CC\  etc.)  is  a  predicate  indicating  whether 
a  certain  one  of  its  arguments  is  known  to  be  contained  in  a  connected  component;  and  M 
is  a  mapping  of  nodes  to  nodes. 


1 
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W\iW  e  V[CC\W)\] 

where 

ca{w)=  \W\<1 

V 

^l,Wr 

[  W  =  Wtii)Wr 

=>  CC'iWi) 

A  CC'(V,) 

A  {Wi^9AWr^9=>^a&Wi,b€  IV4CC'({o,fc})))] 


TRANSCONS  can  easily  check  that  this  meets  the  axioms,  but  the  combination  of  the 
two  halves  by  a  pair  of  arbitrary  elements,  one  from  each  half,  constitutes  a  user-supplied 
invention. 


The  user  must  observe  that  the  current  state  of  CC  is  represented  by  the  choices  of 
pairs  of  arbitrary  elements,  and  introduces  M  to  carry  this  information.  Since  M  repre¬ 
sents  the  state  of  knowledge  of  connected  components,  we  will  define  a  new  binary  predicate 
CC{M,X)  which  denotes  that  the  mapping  M  asserts  that  there  exists  a  connected  com¬ 
ponent  C  such  that  X  c  C.  Taking  a  finite  difference  against  the  addition  of  a  new  set  X 
that  is  known  to  be  connected,  we  get: 


yfX,M3M'[  CC{M',X)aWW[CC[M,W)=>CC{M',W)] 
AVa,6[  ~CC(M,{o,6» 

A  Vy,  Z[CC{M,  {o}  U 1^)  A  CC'(M,  {6}  U  Z) 
=>Yr\X  =  9^Zf]X  =  9] 
=>^CC{M',{a,b})]] 

where 

CC{M,W)=  \W\<1 
V 

[  W  =  Witi)Wr 
=>  CC\W,) 

A  CC'iWr) 

A  (  Wl  9  AWr:^  9 

^3aeWi,beWr[M[a,b)])] 
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The  long  conjunct  on  the  second  through  fifth  lines  state  simply  that  no  connected 
components  are  implied  by  M'  that  aren’t  either  implied  by  M  or  forced  by  X. 

We  invite  the  user  to  make  another  critical  observation,  namely  that  'iW\CC[M,W)  ^ 
CC{M\W)\  can  be  satisfied  by  Va,6(M(a,t)  =>  M'(o,6)].  (S)he  can  further  observe  from 
the  original  axioms  that  CC({o,6}  A  a  6  A  A  CC({6,c})  A  c  €  C  ^  CC{A\JC).  We  can 
thus  liberalize  the  condition  on  M  in  CC  as  follows; 


VJf,  M3Nt[CC{M\X)  A  M(a,  6)  =>  ht{a,  6)  A  . . .] 
where 

CC{M,W)=  11^1<1 
V 

WWi.Wr 

[  W  =  W,y!)Wr 
=>.  CC'{Wi) 

A  CC'{Wr) 

A( 

=>3a€  Wi,b[Mia,b)  A  (6  €  W,  V  CC(M,  {6}  LH^r))]) 


This  specification  is  suboptimal  because  it  allows  M  to  be  multivalued.  We  will  examine 
this  solution  in  detail  and  see  how  it  translates  into  a  tree  that  maintains  M  in  internal 
state.  We  will  then  see  what  can  be  done  to  improve  this. 

We  therefore  make  a  change  in  CC  to  express  the  fact  that  the  divisions  will  always  be 
made  in  the  same  manner,  and  that  M  need  only  be  defined  for  one  set  of  subsets  of  the 
universe.  This  change  is  the  addition  of  a  parameter,  a  subset  of  the  universe  (of  nodes 
in  the  graph  whose  connected  components  we  are  seeking).  Later  we  will  repair  another 
deficiency  of  this  specification:  that  it  allows  M  to  be  larger  than  we  would  like. 

M  will  be  made  a  ternary  rather  than  a  binary  relation.  M(5,a,i)  is  true  if  a  connects 
to  (  relative  to  S.  The  purpose  of  this  is  to  limit  the  size  of  Af . 


The  new  parameter  to  CC  ranges  over  particular  subsets  of  the  universe.  It  has  two 
roles:  it  tells  what  version  of  Af  to  use,  and  it  restricts  acceptable  solutions  to  CC. 
CC(S,M,X)  is  true  only  if  there  exist  elements  of  M{S\x,y),  where  S*  c  5,  that  show 
that  X  is  connected.  This  is  a  stronger  condition  than  the  original  CC{M,W). 

In  addition  to  making  M  and  CC  relative  to  a  given  set  S,  we  will  introduce  functions 
L  and  R  such  that  X>(5)  R{S)  =  S.  The  motivation  for  this  is  that  we  are  trying  to 
establist  a  tree  structure  of  sets  and  subsetsthat  together  comprise  those  sets;  L  and  R  are 
a  Skolemization  of  the  assertion  that  there  is  a  way  of  dividing  the  universe,  each  of  its 
two  subsets,  etc.  that  meets  further  conditions.  The  domain  and  range  of  L  and  R  must 
meet  domain(L)  =  domain(iZ)  =  range(L)  UraDge(^)  U{£^}  -  U. 

To  formalize  the  new  parameter  of  CC  we  write: 


W,  X,M,We  V3MVo,  b[CC{M,  W)  A  CC{M^,  X)  A  M(S,  o,  6)  =►  o,  6)  A  . . .] 

where 

CC{M,W)  s  CC[U,M,W) 
and 

CC{S,M,W)=  \W\<1 

V 

WL,R 

[  L{S)  W  f?(5)  =  S 

Aivi  = 

=►  CC{LiS),M,W,) 

/\CC{RiS),M,Wr) 

A(  IV,  ^0AIV, 

=>3o€lV,,6€lV,lM(S,o,6)l)] 


In  what  follows  we  will  use  the  locution  P5  to  denote  "the  processor  responsible  for  the 
set  5”. 


A  closure  is  needed  here  to  satisfy  CC{L{S),M,Wi)  and  CC{R{S),M,W,).  This 
closure  requires  no  arguments,  because  the  processors  for  A(5)  resp.  R{S)  have  all  of  the 


information  they  need  to  do  their  work.  All  elements  of  Wi  reap.  Vfr  w  in  the  subtree 
headed  by  processor  L{S)  resp.  iZ(5). 

Application  of  the  closure  serves  notice  on  descendant  processors  that  they  should  be 
ready  to  add  to  their  maps  in  a  manner  that  comes  from  the  fifth  conjunct  in  the  large 
expression.  The  need  will  be  described  below. 

Now  we  can  continue  the  synthesis  process  by  applying  transformation  rules  to 
tatiBfy(W.  Jf.M.W  G  V3MVa, i[CC(M, W)  A  CC{M*,X)  A  Af(5,o,6)  =►  ArCS.o.t)!). 
We  soon  find  ourselves  transforming  aatisfy{M*(S,a\V)). 

Suppose  we  add  an  additional  condition,  M{S,a,h)  A  Af(5,a,c)  ^  6  =  c.  After  we 
have  replaced  occurrences  of  M  by  occurrences  of  M*  (as  a  constraint  propagator  would 
do  when  analyzing  •CC(S,Ar,W)”)  and  imposed  this  condition,  we  get  the  following: 


A(  W,?40aW,?£0 

=►  3o  €  W,,6  €  Wr[M'(S,o,6)  A  Vc(Af'{5,o,c)  =►  c  =  6])) 


We  can  not  satisfy  this  last  clause  (the  implicand)  when  3c  ^  Wr[M'(5,  a,e)]  because 
this  conflicts  with  M(S,a,b)  ^  Af'(S,a,b).  Af'(S,a,e)  is  required  by  Af(S,a,e)  and 
forbidden  by  the  requirement  that  3c'  €  Wr[Af'(s,o,c')]. 

However,  we  have  Af(S,a,c)  CC(S,{o,c})  and  CC(Jl(S),{c}UWr)  ^ 

CC(5,{a,c})  =>  CC(S,  {a}  UfVr). 


We  therefore  use  V  to  expose  the  fact  that  there  are  alternatives: 
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a(  »y,  9t0Aiyr9t0 

=>3aeWi,h€  Wr[  (M'(S,a,6)  aHc  b[M{S,a,e)] 

V  3c[M'(S,  a.  c)  A  CC{S,  M',  {c}  U  W'r)])]) 


L**S*^’V 

V.*  *v  ^ 


Ab  it  is  known  that  M(5,a,x)  can  only  be  asserted  by  the  above,  an  inductive  proof  is 
available  that  e  €  R{S).  This  can  therefore  be  replaced  by 


a( 

=>3aeWi,h€Wr[  (Af'(5,o,fc)  A  2c  9^  t[M(S,o,c)] 

V  3c[Af'(S,  o,  c)  A  CC(i?(5),  AT,  {c}  U W^r)])]) 


This  gives  two  alternative  ways  to  satisfy  the  specification.  We  can  satisfy  M'{S,  a,  b) 
if  M(S,a,b)  V  2e[A/(5,  a,c)].  satisfying  the  other  disjunct  is  harder  than  this  because  it 
requires  satisfaction  of  a  predicate  containing  R(S),  so  we  prefer  the  first  disjunct  when 
it  can  be  satisfyed.  If  we  can’t  use  the  first  disjunct,  then  we  know  3c[M(5,a,c)]  so  we 
have  only  to  satisfy  CC(i{(5),Af  ,{c}Ui^r}  for  that  c.  This  leads  to: 


satisfy  ^3o  €  W| ,  6  €  W, 

[(  Af'(5,o,6)A2c9t6[M(5,a,c)] 

V  3c[M'(5,o,c)  A  CC(B(S).Ar,  {c}  U  W'r)])]) 

bind  a  to  arb(W{),6  to  arb(Wr)  in 

if  Af'(5,  o,  6)  V  2*[M(5,  o,  c)l  then  satUfy(M'(S,  o,  b)) 

else  satisfy (M(5,  a,  c)  =►  CC(i?(S),  Af',{c}UW"r)) 


5.4.1.1  Closure  Hequirements 

In  order  to  be  able  to  satisfy  CC{R{S),M',{e}[JWr)  we  will  need  a  closure.  This 
closure  requires  an  argument  because  only  P5  knows  c.  Since  we  have  CC{R{S),M,Wr) 
we  will  need  a  closure  CC{R{S),M,Wr)  =>  action(Cr(z))  =  CC7(ii(5),  M',{c}UW^r)- 

Expanding  CC{...  )  and  renaming  the  variables  with  a  prime  (i.e.,  v  becomes  v'), 
we  need  W{,  W^,  CC{L{S),I^,Wl),  CC{R{S),M*,Wl),  and  (iV/  yt  0  A  . . .  V  3c[...  A 
CC(iJ(5),A/',{c}UI^/)])-  Since  we  know  that  Wl  =  WrVW^  =  {x}\jWr  for  some 
X  €  R{S)  and  similarly  for  Wl,  we  see  that  to  establish  CC{S,M*,{e)\JW)  we  establish 
(and  need  a  closure  to  so  establish)  CC(L(S),  Af*,  {c}  (J iV()  (or  . . .  R[S) . . .  Wf). 

Each  node  Ps  applies  Ci  o  "ce  or  not  at  all,  and  Cr  sero,  one  or  two  times.  We  therefore 
need  two  features  in  order  to  SLtisfy  our  requirement  that  each  closure  be  applied  exactly 
once:  we  need  to  have  a  distinguu'hable  null  message  to  apply  each  closure  to  when  a  node 
knows  that  it  will  not  be  needing  it,  and  we  need  to  have  the  application  of  a  closure 
cause  the  host  to  send  another  closure  that  has  the  necessary  capability.  We  index  the 
communication  variable  to  allow  successive  closures  to  be  distinguished. 

There  is  an  important  fact  that  must  be  noted  in  this  case.  Because  one  result  of  using 
a  closure  is  the  generation  of  a  similar  one,  a  CGF  for  a  closure  must  be  included  in  the 
procedure  attached  to  that  CGF.  This  requires  the  procedure  to  call  itself  recursively. 

The  final  version  of  the  CGF  that  ad<1s  t  1  element  to  P^’s  connected  component  is; 


CC-add—element{z)  = 
cue 
<  =  iill: 

C,(nU)  II  Cr(nU) 

*eL(S)AlVi#0; 

Wi^ziJWi 

Ci(z)  II  C  ^  x!!'‘'^'‘‘’‘'^'lCC-add-element(z)} 
zeB(S)AWrji0: 

Wr*-zUWr 

Cr(z)  II  C  4-  A^‘’”''‘^‘’^'[CC-add-element(z)] 
z  6  1>(S)  aWi  =  0AWr  =  emptyzet : 

Wi^z\}Wt 

C  4-  X^*'^-'^*'<^’\CC-add-element{z)] 

z  €  R{S)  aWi  =  0  AWr  =  emptyzet : 

Wr^zUWr 
C  4- 

zSL{S)AWi  =  0AWri^0-. 

Wi^z\JWi 

Map-add-or-eall-Cr{) 

C  4-  X'J[t’^^>^t>^'[CC-add-element{z)] 
z  €  R{S)  AW,^0AWr  =  0: 

Wr^zUWr 

Map—add-or-eall-CrQ 

C  4-  X';'t‘'^''^>‘<^'[CC-add-element{z)] 

Map-add—or-call-CrO  = 

let  a  =  arb  Wt,  b  =  arb  Wr  do 
If  M(5,o,6)  v2c[M(5,o,c)] 
then  Af'(5,a,6) 

else  let  e  =  M(S,a)  do  Cr(e) 

C  4-  Ar'’"''’^‘’^'[CC-od(i-e/ement(*)) 


We  will  now  explore  the  issues  involved  in  transforming  this  specification  containing 


closures  into  an  equivalent  one  with  downward  communication. 
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5.4.1.2  Closure  Removal  luues 


Only  CC— add— element  can  create  a  closure  C — there  is  only  one  call  to  this  routine 
outside  of  itself,  and  it  is  tail  recursive.  This  implies  that  there  is  no  way  two  CGF-instances 
can  be  live  at  once,  allowing  downward  communication  to  consist  of  only  the  argument  of 
the  closure  ^plication. 

As  in  the  previous  case  we  eliminate  the  closure  by  replacing  the  application  with  a 
setting  of  a  communication  variable  and  making  the  body  of  the  CGF  a  piece  of  code  that 
awaits  such  settings.  The  tail  recursion  that  will  occur  unless  the  value  of  the  communicar 
tion  variable  is  nil,  is  replaced  by  looping  which  tests  for  that  terminating  condition.  The 
resulting  code  with  closures  eliminated  is  displayed  below. 


CC-add-eletnent{z)  s 
case 
s  =  nil : 

<—  nil;  ♦-  nil 
*  €  L(S)  A  iF,  #  0  : 

Ml  Z 

z  €  R{S) /\Wr  ^  9  : 

Wr^zUWr 
Zr  *—  Z 

z  €  L{S)  A  Wj  =  0  A  Wr  =  emptyaet : 
Wi^z\JWi 

z  e  /1(5)  /\Wi  =  9  =  emptyaet : 

Wr^zK^Wr 

*eL(S)AW^,  =  0AW,?t0: 
Wi^z\JWi 

Map-add-or-eall-Cr{) 
z  e  R(S}  AWi  9  =  9  : 

Wr^zlJWr 
Map-add-or-eall-Cr{) 

Map-add-or-eall-Cr{)  = 

let  a  =  aihWi,  6  =  arb  Wr  do 
If  Af  (S,  o,  6)  V  2  c(Af  (5,  a,  c)] 


f  .  s 


^  ■  s.  ' 

i  I 


.■/Av 
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then  Af'(5,a,6) 

else  let  e  =  M{S,  a)  do  Sr  <-  e 

(in  TAnter)  : 

until  sz  =  nil  do 

await  defined(z) 

XX*-  z 

CC— add— element(xx) 

This  completes  the  synthesis  of  the  downward  communication  portion  of  a  parallel 
structure  for  collecting  connected  component  information  from  a  series  of  rows  of  an  adja¬ 
cency  matrix. 

5.4.2  Alternative  Data  Structures 

It  is  now  necessary  to  consider  the  options  for  storing  M.  The  type  o{  M  ixT  xU  —*  U, 
where  U  is  the  set  of  nodes  in  the  graph  whose  connected  components  are  being  determined, 
and  7  is  a  set  of  sets  such  that  f;'€7A(S€7A|5|>l^  il(5)  €  7  A  L(5)  €  7).  The 
genesis  of  7  is  such  that  each  intermediate  node  plus  the  root  of  the  tree  has  as  its  set  of 
leaves  some  element  of  7  if  each  element  ofU  a  represented  by  a  leaf. 

Because  of  the  type  of  M,  we  have  four  simple  options  to  represent  the  mapping:  We 
can  represent  it  in  one  processor’s  memory,  in  the  memory  of  one  processor  per  element  of 
7,  in  one  processor  per  element  of  U,  or  in  one  processor  per  element  of  7  x  17.  The  first 
possibility  would  lack  concurrency  and  the  last  would  require  too  many  processors.  The 
remaining  possibilities  include  using  interior  nodes  of  the  tree  (corresponding  to  elements 
of  7)  or  leaves  (corresponding  to  elements  of  U)  as  the  repository  for  information  about 
parts  of  M. 

Inspection  of  the  specification  yields  the  information  that  the  tree  node  represent¬ 
ing  a  set  5  must  be  able  to  answer  questions  of  the  form  3c[M(5,a,e)  A  e  ^  6]  and 


find  c  suchthat  M{S,a,e),  and  must  be  able  to  aatiBfy(M(5,a,&)).  This  requires  either 
keeping  M(5,z,y)  in  5’s  node  or  providing  that  node  with  appropriate  closures. 

That  node 

must  also  be  able  to  sati8fy(CC(L(5),Af*,lVi))  to  Batisfy(CC(A(5),Af',T^r)))  and  to 
■ati8fy(CC(/i(5),Af',cUW'f))  given  c  €  R(S)  A  CC{R{S),M' ,Wr).  This  requires  an¬ 
other  handful  of  closures. 

Since  closures  to  Batisfy(CC(L(S), Af*,lV|))  and  Batl8fy(CC(J?(5), would 
require  only  information  available  below  L{S)  and  R{S)  respectively,  and  since  there  is 
no  control  flow  path  by  which  the  need  to  satisfy  these  two  predicates  would  be  evaded, 
we  observe  that  each  interior  node  requires  a  =  arb  Wi,  b  =  arb  Wry  and  the  closure 
A*^-->[satisfy  (M'(i?(5).  o,  z))I. 

We  are  building  a  map  that  maps  at  most  one  leaf  of  the  right  subtree  to  each  leaf 
of  the  left  subtree.  As  described,  the  map  is  stored  in  the  node  that  has  the  appropriate 
subtrees.  However,  other  alternatives  are  possible. 

There  are  three  natural  places  to  store  the  assertion  Af{S,a,b).  They  are  the  node 
whose  subtree’s  leaves  are  5,  leaf  a,  and  leaf  b.  If  the  information  is  stored  in  5,  there 
must  be  one  cell  for  each  leaf  of  the  left  subtree,  and  if  the  information  is  stored  in  a  then 
there  must  be  one  cell  for  each  ancestor  representing  5.  If  the  information  is  stored  in  6, 
we  have  no  limit  (beyond  the  size  of  the  problem}  for  the  amount  of  storage  that  must  be 
provided  in  6.  We  therefore  reject  this  alternative. 

Storing  M  in  the  node  heading  S  minimizes  communication  (information  is  where  it  is 
used)  making  the  algorithm  take  0(log  n)  steps.  These  steps  are  not  constant-time  steps, 
however,  because  they  require  access  to  a  random  access  memory  whose  size  is  0(n),  itself 
an  O(logn)  operation.  The  algorithm  therefore  has  an  0(log^  n)  running  time.  It  should 
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be  noted,  however,  that  in  current  technology  the  constant  factor  is  very  small  compared 
to  constant  factors  on  logn  terms  until  the  problem  instance  becomes  very  large. 


The  result  could  be  transformed  to  place  the  fact  of  M{S,  a,  b)  in  a.  This  would  result 
in  a  different  algorithm,  one  that  requires  the  leaves  to  st'pply  closures  to  access  and  modify 
the  map. 


There  is  an  interesting  problem  here.  We  would  prefer  that  the  leaves  not  have  to  know 
about  elements  of  T.  It  would  therefore  be  necessary  to  have  the  M  table  within  each  leaf 
organized  in  a  certain  order  and  to  have  use  made  of  this  information  in  that  fixed  order. 
This  requires  that  a  *^ame  front”  of  subtree  handling  be  arranged  such  that  initially,  the 
root  is  the  tree  for  which  you  are  trying  to  associate  pairs  of  elements,  and  on  succeeding 
subphases,  the  level  at  which  we  are  trying  to  match  descends.  This  algorithm  has  an 
0(log*n)  execution  time  because  there  are  logn  subphases,  each  of  which  is  O(logn). 


We  prefer  the  former  data  structure,  in  which  M(S,a,b)  is  represented  in  S,  because 
the  issue  described  in  the  previous  paragraph  does  not  arise.  That  structure  will  always 
be  available  to  us  unless  the  size  of  a  change  to  M  is  proportional  to  the  size  of  5,  and 
this  can  not  be  because  the  combination  step  of  the  divide  and  conquer  scheme  must  be 
fast  for  the  specification  to  parallelize  well  in  a  tree  structure. 


Below  we  describe  in  detail  the  algorithm  that  results  from  this  decision,  followed 
by  that  which  results  from  storing  the  map  in  the  leaves  and  a  brief  description  of  that 
synthesis  path. 


5.4.3  Results  of  Storing  the  Map  in  Internal  Nodes 


In  this  structure  we  have  a  cell  in  each  subtree  root  (i.e.,  each  internal  node)  for  each 
of  that  subtree’s  leaves.  This  structure  requires  nlogn  cells,  one  for  each  leaf/ancestor 
pair,  and  it  should  lay  out  nicely  in  VLSI  because  the  bigger  nodes  are  closer  to  the  root  of 
the  tree.  Each  cell  must  accommodate  one  of  n  vales,  requiring  log  n  bits.  This  imposes  a 
total  memory  requirement  of  n  log’  n  bits.  Each  internal  node  contains  a  map  which  maps 
names  of  leaves  of  the  left  subtree  into  either  nil  or  names  of  leaves  of  the  right  subtree. 

The  overall  view  of  the  algorithm  is  as  follows: 

Each  leaf  sends  its  parent  its  name  if  its  active,  or  nil.  Each  intermediate  or  root  node 
sends  its  parent  either  the  name  of  any  active  node  it  receives  from  its  children,  or  nil  if 
it  receives  nil  from  both  children.  If  it  receives  two  names  it  chooses  arbitrarily.  Each 
intermediate  node  also  remembers  what  it  received  from  its  children. 

In  addition,  suppose  it  receives  a  name  from  both  children.  There  are  two  cases:  If  the 
name  from  the  left  node  maps  (in  the  node’s  internal  mapping  from  leaves  to  values)  into 
nil,  make  it  map  into  the  name  from  the  right  node  and  do  nothing  else.  If  it  map>s  into 
(say)  t,  send  awaken  t  to  the  right  child  and  do  nothing  else. 

If  an  intermediate  node  receives  an  awaken  t  node  from  its  parent,  it  checks  to  see 
whether  i  is  in  its  right  or  left  subtree.  It  also  checks  to  see  what  it  has  received  before. 

If  a  node  receives  an  awaken  t  message  and  has  already  received  a  name  from  i’s  subtree 
it  sends  awaken  i  message  to  the  appropriate  child.  If  it  hasn’t  received  one  it  considers 
itself  to  have  so  received.  (This  can  have  one  of  three  effects:  modification  of  reaction  to 
further  awaken  messages,  lookup  of  t  in  the  local  mapping  if  i  belongs  in  the  left  subtree, 
or  lookup  of  a  previously  received  name  in  the  local  mapping  if  t  was  in  the  right  subtree 
and  that  previous  name  was  in  the  left.  If  a  lookup  is  performed  we  then  either  extend  the 
mapping  or  create  a  new  awaken  message.) 


The  root  eends  its  children  a  termination  message  when  it’s  done.  Intermediate  nodes 
relay  such  messages.  Each  leaf  reads  the  next  line  of  the  adjacency  matrix  when  it  receives 
this  termination,  and  starts  a  new  cycle. 

The  *wrapup”,  where  each  leaf  gets  the  name  of  a  representative  of  its  connected 
component,  is  also  faster  under  this  arrangement.  The  root  sends  its  right  child  its  corre¬ 
spondences  one  by  one,  followed  by  *end” .  When  a  node  receives  a  — »  6  it  replaces  h  —*  e 
(if  it  has  one)  by  a  — »  e.  This  is  not  done  for  6  — »  nil.  Intermediate  nodes  also  relay 
correspondences  received  from  parents.  When  an  intermediate  node  receives  “end”  from 
its  parent,  it  dumps  its  own  correspondences  as  they  now  stand  and  then  sends  its  own 
“end” .  A  leaf  node  initializes  a  cell  to  its  own  name  and  a  cell  named  h  changes  this  value 
to  a  if  it  receives  a  -*  b.  A  leaf  node  knows  it  has  the  right  value  when  it  sees  “end” . 

To  derive  this  structure  we  make  a  different  decision  when  creating  the  closures  required 
by  the  synthesis.  Rather  than  assuming  that  Ps  needs  no  closure  to  satisfy (Af  (5,  a,  ()) 
or  to  test  3c[M(5,  a,  c)]  we  assume  that  closures  necessary  for  either  of  these  functions  are 
available  from  P£(5),  and  ultimately  &om  P^.  By  a  series  of  steps  similar  to  the  ones  taken 
to  synthesize  the  previous  algorithm,  we  obtain  a  structure  in  which  each  leaf  has  a  cell 
for  each  of  its  ancestors.  This  structure  is  described  in  [HMS84]. 

In  this  structure  there  are  potentially  9(log’n)  conununications,  because  there  are 
potentially  log  n  phases  in  which  it  is  learned  that  a  left  subset  must  link  with  a  right 
subset,  and  each  such  phase  requires  6(logn)  communications  to  operate  on  the  map 
data. 

The  parallel  structure  is  (informally]  as  follows: 

There  is  a  balanced  binary  tree  of  processors  where  each  leaf  of  the  tree  correspond 
to  a  node  of  the  graph.  For  simplicity  of  exposition  we  will  write  the  following  as  if  the 


135 


leaves  were  rather  than  “corresponded  to”  the  nodes.  For  simplicity  we  will  assume  that 
the  entire  adjacency  matrix  is  supplied,  rather  than  only  a  triangular  matrix. 


The  leaf  nodes  build  approximations  to  the  answer  as  the  algorithm  grinds  on.  Each 
leaf  node  has  one  memory  cell  for  each  ancestor.  Consider  the  memory  cell  for  ancestor 
a  in  leaf  /<.  It  is  initialized  to  the  distinguished  value  nil,  and  during  the  course  of  the 
algorithm  it  will  come  to  contain  some  j  such  that  the  least  common  ancestor  (LCA)  of  j 
and  t  is  a,  and  i  and  j  are  known  to  be  in  the  same  connected  component,  provided  that 
some  such  j  exists. 


•  a 


The  algorithm  works  as  follows:  A  leaf  is  called  active  if  its  bit  is  set  in  the  current  row 
of  the  adjacency  matrix.  After  a  row  is  read  in,  information  is  passed  upward  so  each  node 
can  determine  whether  both  of  its  subtrees  contain  active  leaves,  and  what  the  highest 
and  lowest  active  leaves  are  for  such  nodes.  Information  is  then  passed  downward  so  each 
internal  (or  root)  node  can  determine  whether  it  is  the  top  such  node.  That  node  sends  a 
message  to  those  two  extreme  nodes  informing  them  of  each  other’s  identity. 


The  following  cycle  is  repeated: 


TU  computes  spans,  TD  distributes  span  information  and  keeps  track  of  the  topness 


of  nodes. 


VS 


,T'.*  '^"'^  ^  '*•  '  .  •- 
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TV  istype  TREE  (t),  t  €  (l,...,n]  tlxe  n 

root  HAS  minaet,  maxaciy  topp,  li$top,  ritiop 
HEARS  leftBon  (USES  upmtn) 
HEARS  righteon  (USES  upmax) 
TALKS  leftaon  (SENDS  Littop) 
TALKS  rightaon  (SENDS  riatop) 
inter  HAS  minaety  maxaety  topp,  liatopy  riatop 
HEARS  leftaon  (USES  upmtn) 
HEARS  rightaon  (USES  upmax) 
TALKS  leftaon  (SENDS  liatop) 
TALKS  rightaon  (SENDS  riatop) 
TALKS  parent  (SENDS  upmtn) 
(SENDS  upmax) 

leaf  HAS  aetivety  ecmateijyj  €  ancestors 

HEARS  INPUT  (USES  €  (1, . 

TALKS  parent  (SENDS  upmtn) 
(SENDS  upmax) 


it 


n]) 


(in  TUJeafi) 

Vj  €  ancestors 
ecmateij  *—  nil 
Vj  €  (l,...,n) 
temp  ♦-  Oij 

upmtn  *-  upmax  if  temp  then  i  else  nil 

dmin  *—  downmin 

dmax «—  downmax 

other  *—  nil 

pivot  *—  pivot 

if  dmin  =  t  then  other  dmax 
if  dmax  =  i  then  other  *—  dmin 
if  other  ^  nil  then 
if  eematei^„ot  =  nil 

then  awaken  *—  nil;  cemate,jn'vot  ♦-  other 
else  awaken  *-  ecmatei.^voi 

(in  TV.inter) 

•y ;  firat  eatahliah  my  status 
{Ir angel ylrangeh)  *—  Irange 
{rr angel yrrangeh)  *—  rrange 

range  *—  (ndn{lrangel  ylrangeh), maix{rr  angel  yrrangeh)) 
livep  *—  rangei  A  ranges 


; ;  This  ia  a  once  —  per  —  minor  —  phase  activity 
while  datatua  ^  ’dead 


(in  TU.root) 

{Ir angel,  Irangeh)  +-  Irange 
{rrangel,  rrangeh)  *—  rrange 

range  *—  {TDia{lr angel, Irangeh), max{rrangel, rrangeh)) 
livep  *—  rangei  A  ranges 
while  datatua  ^  ’dead 

(in  TD. inter), 
if  patatua  €  {’live,  ’top} 
then  status  *—  ’live 

range  <—  prange 
elaeif  livep  then  status  ’top 
range  <—  range 
else  atatua  *—  ’dead 
while  atatua  /  ’dead 

(in  TD.root) 

if  livep  then  atatua  <—  ’top 

range  range 
else  atatua  ’dead 

In  each  minor  phase  the  leaves  send  up  awakening  information  and  get  back  a  packet 
of  information  very  similar  to  the  one  they  received  in  the  beginning. 

Each  leaf,  when  it  dies  by  finding  that  the  node  just  above  it  is  dead,  sends  up  an  “init” 
message.  When  every  node  has  done  so,  the  root  broadcasts  its  own  form  of  "init”  and  the 
leaves  can  read  from  the  I/O  processor  that  contains  the  next  row  of  the  adjacency  matrix. 

Here  we  describe  the  overo/l  behavior  of  the  algorithm,  considering  the  parallel  structure 
to  be  a  single  entity  that  can  do  things  sequentially.  To  actually  have  this  effect,  there 
are  synchronization  problems,  and  below  we  describe  a  node’s  eye  view  of  the  situation, 
including  the  work  that  each  node  has  to  do  to  coordinate  with  its  neighbors. 


Initialize: 


Have  each  node  read  in  its  element  of  the  adjacency  matrix.  Those  nodes 
reading  a  "1”  in  the  ad|jacency  matrix  turn  themselves  on,  as  does  the 
node  whose  index  corresponds  to  that  of  the  row  of  the  matrix.  Mark  the 
root  as  the  'Yocus” . 

Survey:  Every  leaf  sends  information  telling  whether  it  is  awake.  Using  this  infor¬ 

mation,  the  internal  nodes  below  a  focus  find  out  which  of  them  has  awake 
descendants  in  each  of  the  two  trees  (“has  two  active  subtrees”).  This  is 
a  straightforward  ”up”  problem. 

New  root:  The  highest  node  with  two  active  subtrees  is  determined.  This  is  the  LCA 

of  active  leaves.  It  becomes  the  new  focus,  nodes  between  it  and  leaves 
become  “active” ,  and  nodes  above  it  but  below  and  including  the  old  focus 
become  “dead”. 

Tournament:  Select  an  arbitrary  active  leaf  node  in  each  of  each  focus’s  two  subtrees. 

Report  the  identities  of  the  two  leaves  to  their  focus.  Simultaneously  report 
the  identity  of  the  focus  and  of  the  other  leaf  to  each  of  the  two  leaves. 

Lookup:  The  leaves  contain  a  variable  mapping  mapping  their  ancestors  into  a  leaf 

index  or  the  distinguished  value  nil.  The  leaves  look  up  the  focus  in  this 
mapping.  If  it  is  nil,  they  store  the  other  leaf’s  identity.  If  the  left  leaf’s 
value  is  not  nil,  report  the  value  to  its  focus. 

New  awakening:  If  its  left  tree  reports  a  leaf  ID  per  Lookup,  a  focus  sends  a  message  to 
that  leaf  commanding  it  to  awaken. 

Refocus:  Each  focus  sends  a  message  to  those  of  its  children  that  are  not  leaves 

telling  them  to  become  new  focuses,  and  dies. 


Repeat  (Maybe):If  not  all  leaves  have  a  dead  parent,  go  back  to  New  Root. 


As  can  be  seen  above,  the  algorithm  has  several  subphases,  as  the  focus  moves  down 
towards  the  leaves,  and  each  of  these  aubphases  has  several  sub-sub>phases:  Survey,  New 
root,  Tbumament,  Lookup,  New  Awakening,  Refocus,  and  Repeat  (maybe).  Internal  nodes 
of  the  tree  have  the  status  dead,  focus  or  live,  and  leaf  nodes  either  have  status  awake  or 
asleep.  The  behavior  of  each  node  during  each  sub-sub-phase  will  be  described. 

Survey:  Leaves  tell  parents  whether  they  are  active.  Intermediate  nodes:  (live  and  focus 
only)  Get  status  &om  descendants.  Remember  and  (live  only)  tell  parent  how  many  sub¬ 
trees  have  one  or  more  active  subtrees.  Remember  which  subtree  was  active  if  exactly  one 

was. 

New  root:  If  a  focus  has  two  active  subtrees  it  tells  its  left  (resp.  right)  child  'Yocus  above 
you=  {node) ,  you  are  left  (reap,  right)” .  If  it  has  one,  tell  that  one  ”focus  at  or  below  you” 
and  the  other  ”die” .  It  is  impoasible  for  a  focus  to  have  no  active  subtree. 

Intermediate  nodes  below  a  focus  (i.e.,  those  nodes  that  are  live)  listen  to  their  parents. 
If  one  hears  *die”  it  dies.  If  one  hears  "focus  above  =  xxx  ...”  it  relays  the  message  and 
becomes  or  remains  live.  If  one  hears  "focus  at  or  below”  it  acts  as  in  the  paragraph  above. 

Leaves  that  receive  a  "die”  message  send  their  parent  an  "I  died”  message  and  prepare  to 
read  the  next  line  of  the  adjacency  matrix. 

Active  leaf  nodes  record  the  name  of  their  focus. 

Tournament  and  Lookup:  Each  leaf  contains  a  mapping  M  relating  the  name  of  each  of  its 
ancestors  to  either  nil  or  the  index  of  a  leaf.  A  sleeping  leaf  node  sends  nil  to  its  parent.  An 
awake  leaf  node  i  that  receives  a  "focus  above  you=  (node),  you  are  left”  message  sends  to 
its  parent  either  (empty, i)  if  M(node)  =  nil,  or  (loaded,  Af (node)).  If  it  receives  "focus 
above  you=  (node),  you  are  right”,  it  sends  t  to  its  parent. 

A  live  internal  node  which  receives  nil  from  both  children  sends  the  same  to  its  parent; 
one  that  receives  something  else  from  one  child  sends  that  value  to  its  parent,  and  one  that 
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receives  non-nU  values  from  both  children  sends  either  to  its  parent.  The  correctness  of 
the  algorithm  does  not  depend  on  this  choice,  which  can  be  random,  pseudo-random,  or 
consistent. 

Each  focus  receives  a  message  from  each  child.  Say  the  right  child’s  message  is  j.  If  the 
left  child’s  message  is  (empty, i),  then  (record, /ocus,(,y)  is  sent  to  the  left  child  and  nil 
is  sent  to  the  right.  If  the  left  child’s  message  is  (loaded,  i),  then  nU  is  sent  to  the  left 
child  and  (awaken,  t)  is  sent  to  the  right. 

Lookup  and  New  Awakening:  Internal  nodes  relay  parents’  messages  to  their  children. 

If  leaf  node  t  receives  (record, /oeus,t,y)  it  sets  Af(/oeus)  <—  jf.  If  it  receives  (awaken,  t) 
it  awakens.  (If  t  doesn’t  match,  it  does  nothing.) 

Refocus:  Each  focus  sends  its  children  a  *become  a  focus”  message  and  dies.  A  live  node 
receiving  such  a  message  from  its  parent  changes  its  status  to  ”focus”.  A  leaf  receiving 
such  a  message  form  its  parents  sends  the  latter  an  *I  died”  message. 

Repeat  (maybe):  At  all  times,  a  node  receiving  two  *I  died”  messages  sends  one  upward.  If 
a  node  receives  a  "become  a  focus”  message  it  sends  its  children  a  "begin  survey”  message. 
Live  intermediate  nodes  relay  such  a  message,  and  leaf  nodes  receiving  a  "begin  survey” 
message  proceed  as  in  Survey. 


Chapter  6 


Use  of  Additional  Techniques  -  Binary  Addition 


There  are  three  important  classes  of  circuits  for  the  addition  of  integers  represented  as 
▼ectora  of  *bits”  in  radix  2.  These  occupy  three  positions  on  a  spectrum  of  cost/speed 
tradeoffs.  The  fastest  and  most  expensive  circuit  is  a  carry-look-ahead  adder,  described  in 
[Hwa79],  which  performs  addition  of  two  n-bit  integers  in  O(logn)  time  using  0{n)  logic 
elements.  An  intermediate  circuit,  the  ripple  carry  adder,  takes  linear  time  and  also  uses 
a  linear,  but  smaller,  amount  of  logic.  The  slowest  and  cheapest  circuit  b  a  serial  adder 
which  uses  a  small  constant  amount  of  logic  to  perform  additions  in  linear  time  (with  a 
larger  proportionality  constant  than  that  of  a  ripple-carry  adder). 

There  are  three  reasons  for  studying  the  synthesis  of  these  addition  circuits  in 
TRANSCONS.  They  are: 

•  There  is  a  large  and  interesting  space  of  alternative  implementations.  If  we  can  not 
synthesize  all  of  the  implementations  there  is  cause  to  wonder  whether  TRANSConS 
is  general  enough. 

•  The  three  implementations  of  binary  addition  to  be  discussed  here  fit  well  into  VLSI. 

•  The  synthesis  paths  shown  here  demonstrate  well  how  general  mathematical  knowl¬ 
edge  fits  together  with  TRANSCONS  techniques  to  develop  VLSI  circuits  that  can 
not  be  developed  using  either  alone. 


In  the  remunder  of  this  Chapter  we  expoee  the  necessary  set  of  synthesis  techniques  to 
create  implementations  of  the  three  solutions  to  the  problem  of  adding  numbers  represented 
as  bit  vectors. 

6.1  Notation 

In  what  follows,  we  will  assume  that  a  problem  instance  resides  in  vectors  A  and  B, 
each  containing  individual  *bits”  a,-  resp.  for  0  <  t  <  n  —  1.  The  two  states  of  a  bit 
are  represented  by  the  values  0  and  1.  (This  discussion  is  specialized  to  binary  integers, 
but  any  radix  can  be  used  by  reinterpreting  the  logical  operators  in  an  obvious  manner.) 
We  apply  logical  operators  to  the  values  0  and  1,  interpreting  0  as  false  and  1  as  true. 
The  vectors  A  (resp.  B)  represents  ^o<(<n-i (^^P-  •••^2').  The  answer  is  similarly 
represented  in  C.  We  will  have  occasion  to  refer  to  earryi,  the  carry  coming  into  position 
t.  We  use  0  as  the  symbol  for  "exclusive  OR”.  We  wiU  use  n  only  as  the  size  of  the  vectors 
to  be  added  throughout  this  chapter. 

Our  starting  point  for  all  of  the  syntheses  in  this  chapter  will  be  the  foUowing  specification, 
which  produces  a  vector  C  given  vectors  A  and  B  as  above: 

VO  <  I  <  n  -  1 

Ct  =  0, 0  bi  0(3i  <  i)[o,-  A  bj  A  (vy  <k<  i)[o*  V  bk]] 

Figure  6.1:  The  "Standard”  Specification  of  Binary  Addition 

We  will  be  deriving  the  following  "grade  school”  specification  for  binary  addition  (so  called 
because  it  corresponds  closely  to  the  algorithm  taught  to  grade  school  pupils  for  decimal 
addition)  from  the  standard  specification.  A  derivation  of  the  standard  specification  from 
the  grade  school  specification  is  possible  by  the  methods  of  [StoTT],  but  will  not  be  given 


earryo  =  0 
V  0  <  t  <  n  -  1 
Ci  =  a<06<0carr»i 
earryi+i  =  {earryt  A  (a,  V  i,))  V  (a,  A 

Figure  6.2:  “Grade  School”  Specification  for  Binary  Addition 
6.2  Carry  Look-ahead  Circuit 

Consider  the  standard  specification.  If  we  try  to  use  the  methods  of  TRANSCONS  that 
have  been  described  so  far  to  synthesize  a  carry-look-ahead  circuit  for  addition,  we  get  a 
circuit  with  0{ri*)  computing  elements.  The  reason  for  this  is  the  nesting  of  quantifiers 
such  that  the  bound  variable  of  the  outer  quantifier  is  one  end  of  the  range  of  the  inner 
one.  This  fact  forces  the  computation  of  boolean  values,  namely  (Vy  <  ib  <  t)[afc  V  6/^] 
for  each  0  <  j  <  i  <  n  —  1  (a  total  of  n(n  —  l)/2  (i,y)  pairs). 

We  would  like  to  do  better. 

6.2.1  Quantifier  Levelling 

The  problem  with  the  standard  specification  is  that  it  has  a  pair  of  nested  quantifiers, 
with  the  range  of  the  inner  quantifier  equal  to  the  bound  variable  of  the  outer,  used  as 
a  predicate.  Specifically,  we  have  c,-  =  ®(3y  < 0|oy  A  bj  A  (Vj  <k<  i)[ofc  V  6*]]. 

Evaluating  this  predicate  is  expensive.  Even  reusing  values,  i.e.,  using  (Vy  + 1  <  Jb  <  t)[at  V 
6ji]  to  compute  (Vy  <k<  t)[at  V(ik],  6(n*)  computing  elements  are  required  to  evaluate  the 
form.  However,  it  is  possible  to  proceed  in  a  series  of  steps  to  a  form  that  can  be  evaluated 
using  only  6(n)  elements. 

First  we  use  the  foUowing  identity 


(V/  <  X  <  «)[P(x)]  =  mu[  F(x)]  <  /  (V-to-mox) 
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which  gives  us  a  tool  to  express  a  doubly  bounded  quantifier  as  an  inequality  applied  to 
a  max  operator.  The  benefit  of  doing  this  is  that  it  reduces  the  multiplicity  of  values  to 
be  computed.  Rather  than  have  a  6(n)  set  of  booleans  to  compute  for  each  j,  we  have  a 
single  log  n  bit  integer  to  compute  and  compare  with  j. 

This  doesn’t  solve  the  problem.  We  are  left  with 

Ci  =  a,  (g)  6,  (g)(3y  <  i)[a,  A  6y  A  max[  (a*  V  6*)]  <  y] 
where  the  substituted  expression  is  underscored. 

e  are  still  faced  with  the  problem  of  computing  the  max  operator  for  each  of  the  I’s  and 
using  a  quantifier  as  a  boolean  to  establish  a  carry  bit  for  each  i.  We  are  therefore  going 
to  take  advantage  of  another  identity  to  turn  a  singly  bounded  quantifier  into  a  doubly 
bounded  one: 

(3x  <  «)|P(z)  A  F{u)  <  x]  =  (3f’(«)  <  X  <  a)[P(x)]  (conetramUto-binder) 

providing  the  following: 

Ci  =  ai(^bi  (g)(3mM(  (a*  V  4*))  <j<  i)[o,  A  6,] 
where  again  the  new  form  is  underscored. 

This  does  not  completely  solve  the  problem,  because  it  would  still  require  a  max  operation 
for  each  t,  but  it  allows  us  to  apply  one  last  identity, 


(3/  <  x  <  u)[P(x))  =  mu(P(x)]  >  l(3>to>max) 


which  gives  us 


e<  =  «» 0  0(m«W  A  fc,l  >  mwl  (ofc  V  k*)]) 


Now  we  have  an  inequality  involving  the  fruits  of  two  max  operations.  While  each  of  these 
max’es  must  be  computed  for  each  t,  this  is  a  form  that  can  be  treated  by  the  methods 
of  Section  5.2.  It  is  therefore  only  necessary  to  build  two  tree  structures  in  which  the 
computing  elements  contain  d(logn)  logic  gates. 


The  specification  is  now 

VO  <  i  <  n  -  1 

c,  =  Oi  0  6, 0(mM[o,  A  5,]  >  max[~  (ot  V  6*))) 


It  is  possible  to  express  this  as  an  inequality  between  corresponding  elements  of  the  results 
of  two  parallel  prefix  computations  as  follows: 

V0<i<n-1 
andi  =  Of  A  bi 
nofj  =~  (oi  V  6i) 

maxlaneli  =  if  andi  then  i  eke  -  oo 
mazlnor,-  =  if  nor^  then  «  eke  -  oo 
maxandi  =  [mazlandy]  * 

maxnoTi  =  ^mu  {mazlnor^]  * 

—  aa'06t-0(mazandi  >  maznor^) 


There  are  two  parallel  prefix  trees  in  the  addition  parallel  structure:  one  for  the  variable 
named  mazand  and  another  for  maxnor.  The  overall  structure  is  shown  below. 


Figure  6.3:  Synthesized  Look-Ahead  Circuit  for  Binary  Addition 


There  are  two  important  differences  between  this  structure  and  the  standard  one  of 

[Hwarg]. 

•  Because  of  the  nature  of  the  parallel  prefix  network  synthesized  by  TransCONS, 
each  node  is  partially  responsible  for  the  choreography  in  its  local  region.  The  im¬ 
portance  of  this  fact  is  that  either  the  nodes  need  be  big  enough  to  participate  in 
an  asynchronous  data  transfer  protocol  with  a  handshake,  or  a  global  clock  must  be 
provided. 

This  is  not  a  serious  problem  because  the  methods  of  Section  5.3  enable  us  to  remove 
dependence  on  local  handshaking,  reducing  the  computation  nodes  to  combinatorial  logic. 

•  Because  the  parallel  prefix  trees  are  required  to  handle  integers  in  the  interval  [0,  n], 
the  size  of  the  nodes  and  the  width  of  the  data  paths  within  the  trees  are  d(log(n)). 
In  the  standard  network  it  would  be  9(1). 

This  disadvantage  can  be  alleviated  by  some  careful  reasoning,  to  be  described  below. 


6.2.2  Data  Path  Width  Reduction 

To  reduce  the  width  of  the  data  paths  and  still  use  a  parallel  prefix  network,  an  as* 
Bociative  operation  with  constant  range  and  domain  must  be  used.  We  see  that  this 
might  be  possible  because  either  maxo<y<i{maxlandy]  =  maxo<y<,'.t.i  [moxlandy]  or 
maxo<y<«4i[maxlandy]  =  t  +  1,  and  similarly  for  mazlnor.  We  have  four  cases  from 
the  four  possible  values  of  a,>i  and  hi+u  or  similarly  from  andi+i  and  For  brevity 

we  will  give  the  name  to  maxandt  >  maxnori.  We  can  show  that  depends  only  on 
Pi,  andi+i  and  norj^.^  from  the  fact  that  if  neither  andi+i  nor  nori^i  we  have  Pi+i  =  Pi, 
if  and^>i  we  have  Pi+i,  and  if  noti^x  we  have  Pi+\.  This  information  can  be  summarized 
in  the  following  table: 


(*this  IS  impossible  but  knowledge  of  this  fact  is  unnecessary  for  the  argument) 

The  efiiect  of  nori.^1,  and  nori^^  on  the  truth  of  maxandi^^  >  maxnori^i 

given  maxandi  >  maxnori  can  also  be  summarized  as  below.  (Here  the  impossible  combi¬ 
nations  have  been  omitted  for  brevity.) 


maxandi  ^  maxnori  => 


true  false 


f^ndi+i,andi+i 


andi+i,nori+i 


norj+j,and,>i 


nori+i,nori^l 


true 

false 

true 

true 

false 

false 

true 

true 

true 

true 

true 

true 

false 

false 

false 

false 

false 

false 

A  similar  (although  lengthy)  table  can  be  made  encompassing  three  and/ nor  value  pairs. 

We  see  &om  the  two  pair  table  that  any  two  input  bit  pairs  is  an  operator  that  can  do  one 
of  three  things:  it  can  act  like  an  a,-  =  6,  =  true  bit  pair  (called  (and)  below),  like  an  a,-  = 
hi  =  false  bit  pair  (called  (nor)),  or  like  aaoi^hi  bit  pair  (called  {other)).  From  the  above 
table  it  can  be  seen  that  the  binary  operator  0  =  As,y[if  y  =  (and)  then  (and)  elseif  y  = 
(nor)  then  (nor)  else  x]  describes  the  result  of  combining  two  adjacent  columns.  A  case 
analysis  on  all  possible  triples  shows  that  this  operator  is  associative.  Using  F(a,-,  b,-)  as 
an  abbreviation  for  if  Oj  A  hi  then  and  elseif  (oj  V  hi)  then  nor  else  other,  it  is  therefore 
possible  to  write  Cj  *—  a,-®  6(0(0><tl^(nj>bj)]  =  (()and)).  The  identity  of  this  operator 
is  (other),  and  maxandi  >  maxnori  =  (0o<}<t  =  (nnd)).  This  is  precisely  what  was  needed 
to  perform  a  parallel  prefix  siunmation  with  constant-width  data  paths:  an  associative 
operator  with  finite  range  and  domain. 

Use  of  a  specification  based  on  this  operator  will  yield  a  network  similar  to  that  of  Fig¬ 
ure  6.3,  except  that  there  will  only  be  a  single  parallel  prefix  tree,  each  bit’s  carry  will  be 
used  directly  rather  than  computed  firom  the  two  parallel  prefix  trees,  and  (of  course)  the 
widths  of  the  data  paths  and  the  size  of  the  nodes  will  be  smaller. 

6.3  Ripple-carry  and  Bit  Serial  Circuits 

Consider  our  “standard  specification”  of  Figure  6.1.  If  we  apply  the  quantifier  levelling  of 
subsection  6.2.1,  we  get: 

V  0  <  I  <  fi  -  1 

Ci  =  0,  (2)  hi  (2)(max[a,  A  6,|  >  max[~  (a*  V  6*)]) 


If  we  change  this  to 


carry- 1  «—  false 

V  0  <  1  <  n  -  1 

carry,- «—  xnax[aj  A  &,-)  >  max[~  (a*  V  6*)] 

V  0  <  i  <  n  -  1 

c,  =  Of  0  bi  0  carryi-i 


we  can  repeat  the  reasoning  of  the  first  part  of  the  last  section,  obtaining  the  recurrence 
relation.  This  gives  us 

carry- 1  *—  false 
V  0  <  I  <  n  -  1 

earryi  *-  (a,  A  bi)  V  (carry,_i  A  (o^  V  bi)) 

V0<i<n-1 

=  Ot  0  kt  0  corry,-! 


This  is  logically  identical  to  the  gradeschool  specification  of  Figure  6.2.  More  importantly, 
it  can  be  transformed  into 


PC  Istype  PROCESSORS  (i),0  <  i  <  n  - 1 
H^S  Cj 

HEARS  PA  (USES  Oi) 

HEARS  PB  (USES  6.) 

HEARS  Pcy,--i  (USES  carry,'-i) 

TALKS .  .  . 

Pey  Istype  PROCESSORS  («),-!  <  t  <  n  -  1 
HAS  carry,- 

If  I  >  0  thei.  HEARS  PA  (USES  a.) 

If  t  >  0  then  HEARS  PB  (USES  bi) 

If  I  >  0  then  HEARS  Pcy,-_i  (USES  carrjfc-i) 
if  I  <  n  -  1  then  TALKS  Peyi+i  (SENDS  carry,) 

(in  Pcy-1)  : 

carry-j  <—  false 
(in  Pcjfc,  0  <  I  <  n  -  1)  : 

carry,  4-  (a,  a  6<)  V  (carry^-i  A  (o<  V  bi)) 

(in  PC,  ,  0  <  I  <  n  -  1)  : 
c,-  ♦-  a,-  0  bi  0  carry,- 


150 


using  the  methods  of  crystalline  synthesis  described  earlier  in  this  thesis.  A  diagram  of 
the  resulting  parallel  structure  (after  explicating  all  values  internal  to  the  computations) 
is  shown  below. 


Figure  6.4:  Ripple  Carry  Parallel  Structure 


The  aggregation  of  Section  3.3  is  then  applicable.  This  technique  replaces  a  related  series 
of  processing  elements  by  a  single  element  that  receives  a  series  of  related  data.  The  circuit 
of  Figure  6.4  is  an  indexed  series  of  identical  modules,  and  identifying  corresponding  nodes 
of  the  series  of  modules  gives  the  bit  serial  addition  circuit  shown  below. 


Figure  6.5;  Serial  Adder 


We  have  seen  that  quantifier  levelling,  recurrence  relation  analysis,  crystalline  synthesis 
and  aggregation  together  are  sufl!icient  to  synthesize  the  three  types  of  binary  addition 
circuit  in  common  use.  It  remains  to  prove  the  identities  used  in  quantifier  levelling.  We 
do  this  in  Appendix  Section  C. 

We  have  seen  that  the  use  of  general  mathematical  identities  considerably  increases  the 
power  of  Trans  Cons  to  synthesize  VLSI  circuit  topologies. 


Chapter  7 


Conclusions  and  Summary 


7.1  Overview 


TRANSConS  is  a  collection  of  tools  and  methods  for  creating  parallel  structures  from  first 
order  logic  specifications.  This  collection  of  capabilities  interacts  in  various  ways  to  act 
as  a  VLSI  assistant.  We  have  described  the  three  major  divisions  of  these  abilities  in  the 
body  of  the  thesis  and  will  summarize  them  below. 


We  have  also  described,  and  will  sununarize,  the  theoretical  framework  in  which  a  VLSI 
synthesis  system  must  work.  This  includes  models  underlying  the  various  structures  that 
can  be  created,  the  theory  needed  to  justify  the  application  of  some  of  the  transformer 
tional  rules  of  TRANSCONS,  and  the  underlying  assumtions  that  justify  the  incorporated 
heuristics. 


In  this  chapter  we  sununarize  this  thesis  by  stepping  back  and  describing  its  results,  its 
main  points,  suggested  future  lines  of  research,  and  the  meaning. 


7.2  Essential  Points 


We  claim  that  we  have  devised  theory  for  the  following  forms  of  transformational  synthesis 
of  concurrent  structures: 

•  creation  of  lattice-like  arrays  of  processors  from  specifications  in  a  very  high  level 
language  resembling  first  order  logic; 

•  modifications  of  intermediate  forms  synthesized  while  creating  lattice-like  arrays  by 
several  techniques  called  aggregation,  virtualization,  communication  reduction,  and 
chain  creation  to  improve  the  lattices  that  eventuaUy  result  from  some  that  it  would 
be  impractical  to  build  to  others  that  would  be  practical; 

•  creation  of  tree  structures  of  processors  from  FOL  specifications  using  a  variant  of 
D&C  in  which  only  certain  ways  of  combining  problems*  partial  solutions  are  per¬ 
mitted; 

•  use  of  closures,  or  communicable  functional  objects,  to  facilitate  reasoning  about 
bidirectional  communication.  Instead  of  reasoning  about  a  message  from  A  to  B, 
reason  about  a  closure  sent  from  B  to  A.  This  enables  a  later  transmission  from  A 
to  B  whose  effect  has  already  been  studied  by  the  closure  synthesis  process; 

•  combinations  of  any  of  the  above,  using  user  supplied  aggregations. 

We  further  claim  that  this  is  done  in  a  manner  so  as  to  facilitate  use  of  expected 
future  advances  in  theorem  proving  technology. 

We  further  claim  that,  using  these  techniques,  a  system  can  be  built  to  do  all  of  the 
following  on  a  practical  basis: 

•  synthesize  regtilar  interconnections  of  regular  arrays  of  processors,  together  with  pro¬ 
grams  for  the  processors,  to  perform  interesting  and  useful  computations  which  are 
q>ecified  in  a  very  high  level  language  similar  to  first  order  logic; 


synthesize  systolic  arrays; 


•  synthesize  tree  interconnections  of  processors; 


•  synthesize  lower  level  specifications  &om  any  of  the  above,  in  some  cases  down  to 
something  suitable  for  direct  VLSI  implementation  after  a  placement  and  routing 
algorithm  (not  discussed  here)  has  been  run. 


Trans  Cons  will  be  an  effective  tool  for  allowing  integrated  circuit  designers  to  cope  with 
the  million  gates  that  will  shortly  be  available  on  a  single  chip  in  a  manner  that  does  not 
merely  provide  larger  versions  of  circuits  already  available.  Automatic  programming  tools 
have  the  ability  to  bring  problem  decomposition  and  program  combination  knowledge  to 
bear  on  the  task  of  synthesizing  programs  too  large  and  complex  to  write  by  hand.  In  an 
analogous  manner,  TRANSCONS’s  ability  to  bring  processor  interconnection  and  problem 
decomposition  knowledge  to  bear  on  the  task  of  synthesizing  VLSI  will  make  possible  the 
creation  of  accurate  integrated  circuits  too  complex  to  create  by  hand. 


7.3  Foundations 


There  are  several  fundamental  points  upon  which  this  work  rests. 


In  order  to  rationaUy  discuss  the  synthesis  of  concurrent  systems,  we  must  have  at  least 
one  computation  model  in  mind.  TRANSCONS  has  a  series  of  four  models,  each  more 
restrictive,  more  descriptive,  and  able  to  use  simpler  processing  elements  than  the  previous 
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There  are  transformation  rules  dealing  with  processor  assignment.  We  initially  make  very 
simple  assumptions  of  how  things  will  be  organized  in  a  parallel  structure,  and  refine  these 
with  other  transformation  rules. 
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The  models,  processor  assignments,  and  refinements  are  three  fundamental  parts  of 
Trans  Cons  that  we  have  described  in  previous  chapters  and  will  summarise  below. 


7.3.1  Models 


We  distinguish  four  types  of  processing  elements  for  which  it  might  be  desirable  to  write 
specifications.  These  models  of  processing  elements  occupy  positions  along  a  continuous 
spectrum  of  minimum  complexity  to  implement.  They  range  from  models  that  require 
a  general  purpose  processor  capable  of  rapid  "context  switching”  among  several  loosely 
related  jobs  and  having  sufficient  memory  to  keep  track  of  each  of  the  jobs  and  associated 
data,  down  to  one  whose  minimum  processing  element  would  be  a  latch  connected  to  a 
minor  piece  of  combinatorial  logic  such  as  an  AND  gate. 


The  first  part  of  TRANSCONS  transforms  a  supplied  specification  from  its  FOL  form  into 
a  parallel  structure  that  meets  the  requirement  of  the  first  model  only.  Following  this,  a 
series  of  transformations  can  proceed  from  model  to  model  until  the  lowest  level  in  which 
the  specification  can  be  solved  is  reached. 


We  provide  the  multiple  models  because  it  is  not  always  desirable  or  even  possible  to 
transform  a  specification  to  the  lowest  level.  Each  level  corresponds  to  an  implementation 
decision  that  can  be  correct  under  certain  assumptions  of  component  availability,  future 
transformation  of  the  output  by  other  systems,  and  finality  of  the  design. 


7.3.2  Processor  Assignment 


The  fundamental  unit  of  responsibility  in  parallel  structures  that  TraNSConS  synthesizes 
is  the  array  element.  If  two  processors  cooperate  in  any  way  to  produce  part  of  an  answer, 
that  cooperation  must  be  expressed  as  use  of  values  computed  in  one  processor  by  another. 
Such  uses  of  communicated  values  constitute  the  fundamental  unit  of  computation. 


When  the  specification  is  delivered  to  TRANSCONS,  it  may  not  be  possible  to  isolate 
a  natural  array  around  which  processors  can  be  multiplied.  In  such  a  case,  whenever 
there  is  enough  work  to  make  it  profitable  to  split  the  problem  among  a  large  number  of 
processors,  there  must  be  a  reduction  operation  within  the  specification  that  generates  a 
series  of  intermediate  values.  The  technique  of  virtualization,  or  explicating  the  multiple 
assignments  to  this  variable  into  assignments  to  different  elements  of  a  corresponding  array, 
can  be  applied. 


If  the  result  of  applying  TRANSCONS  to  a  specification  produces  a  paredlel  structure  that 
has  more  processors  than  desirable,  the  technique  of  aggregation  can  be  applied  to  group 
together  processors  that  perform  similar  operations  on  different  data  at  different  times, 
and  arrange  for  all  of  the  computations  in  each  group  to  take  place  in  a  single  processor. 
There  is  an  aggregation  that  reverses  any  virtualization,  but  when  one  aggregation  cm  be 
performed  several  can  be  performed  (on  any  but  a  one-dimensional  structure).  We  have 
found  the  technique  of  performing  a  virtualization,  and  then  an  aggregation  that  merges 
diagonal  groups  of  processors  (picturing  the  array  of  processors  geometricaUy)  to  be  an 
effective  technique  for  producing  one-  and  two-dimensional  systolic  arrays. 
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7.3.3  Connectivity  Restructuring 

After  re8p>onBibility  for  computing  array  elements  is  allocated  among  proceaaors,  there 
is  an  immediately  suggested  network  of  interconnections  among  the  processors,  i.e.,  the 
particular  network  that  has  a  wire  running  directly  from  the  processor  computing  each 
datum  to  each  of  those  processors  needing  the  datum.  If  this  structure  is  satisfactory,  the 
synthesis  process  is  complete,  but  in  none  of  the  problems  we  examined  has  this  obvious 
network  been  satisfactory. 

Two  solutions  to  this  problem  are  reduction  of  snowballing  induced  sets,  and  chain  forma¬ 
tion. 

Snowball  reduction  involved  the  discovery  that  there  is  a  series  of  processors,  each  of  which 
requires  (in  part)  some  of  the  set  of  values  required  by  its  predecessor,  if  any,  plus  that 
computed  by  that  predecessor  itself.  If  this  and  several  other  minor  conditions  are  met, 
the  set  of  coimections  in  which  each  processor  is  connected  to  all  previous  processors  in 
the  series  can  be  replaced  by  the  set  of  connections  in  which  each  processor  is  connected 
only  to  its  predecessor. 

When  a  large  number  of  processors  each  need  to  be  connected  to  an  I/O  processor,  we 
have  an  opportunity  for  chain  formation.  If  it  can  be  established  that  the  set  of  values 
required  from  this  I/O  processor  falls  into  groups  such  that  each  value  in  each  of  the 
groups  is  required  by  a  distinguished  group  of  processors,  then  it  is  possible  to  first  form 
those  groups  of  processors  into  a  ^bucket  brigade”  chain  and  then  to  introduce  the  values 
required  by  the  processors  in  the  chain  at  one  end  only. 

Another  opportunity  for  chain  formation  arises  when  the  acceptable  time  for  delivering  the 
distributed  result  of  a  computation  to  the  I/O  processor  is  sufficiently  large  to  allow  the 
processors  to  form  a  bucket  brigade  to  deliver  these  results  rather  than  having  each  do  it 
with  its  own  connection.  In  this  case  the  collection  of  processors  into  groups  is  arbitrary. 


7.3.4  Divide  ic  Conquer,  and  Closures 


Many  specifications  are  amenable  to  solution  by  parallel  structures  in  which  the  processors 
are  connected  in  a  binary  tree.  We  use  D&C  to  find  tree  solutions  where  they  exist.  The 
temptation  to  do  this,  of  course,  is  the  commonality  of  form  between  the  nodes  of  a  call 
graph  that  would  result  from  execution  of  a  program  that  fit  the  D&C  scheme  and  the  tree 
structure. 

We  have  isolated  several  problems  that  frequently  arise  when  we  try  to  perform  such  a 
synthesis  in  an  obvious  manner.  In  some  cases  a  simple  D&C  solution  would  require 
solution  of  half  of  the  problem  before  solution  to  the  other  half  can  be  attempted.  In  other 
related  cases  each  half  of  the  divided  problem  can  be  solved  without  reference  to  the  other 
but  combination  of  the  two  halves’  solutions  requires  significant  work.  In  yet  other  cases 
the  combination  requires  the  handling  of  significant  data  in  processors  close  to  the  root.  All 
of  these  problems  can  be  met  by  changing  our  view  of  the  problem.  For  some  specifications 
several  D4{C  solutions  are  available,  but  each  has  one  of  these  three  problems. 

Instead  of  considering  our  task  to  be  computing  some  array  from  some  other  array,  we 
consider  our  task  to  be  the  computation  of  a  functional  object  which,  when  applied  to 
a  given  argument  list,  produces  the  desired  effect.  In  the  many  cases  we  explored,  this 
solves  the  group  of  problems  described  in  the  previous  paragraph.  In  addition,  this  solves 
another  problem  facing  the  use  of  D&C  to  synthesize  tree  structures,  i.e.,  the  fact  that  real 
problems  often  require  conmnmication  both  up  and  down  the  tree  but  D&C  seems  suited 
to  reasoning  about  upward  conununication  only.  It  solves  this  problem  by  allowing  the 
upward  flow  of  a  closure  to  stand  for  the  downward  flow  of  information. 

It  is  possible  to  implement  a  structure  including  closures  using  the  highest  level  model  of 
TRANSCONS.  We  do  not  wish  to  leave  closures  in  the  finished  product,  however,  because 
closures’  implementation  requires  extra  communication  and  also  requires  the  structure  to 


be  implemented  in  this  highest  level  model.  We  therefoK  provide  methods  for  removing  the 
closures,  modifying  a  parallel  structure  that  contains  them  into  one  that  contains  explicit 
communication,  replacing  the  information  that  flows  from  the  closures’  recipients  to  the 
hosts  when  they  are  applied. 

The  closure  can  be  used  to  make  DieC  a  more  powerful  tool  for  program  synthesis,  as  well 
as  for  making  it  a  practical  tool  for  concurrency  syntheu.  A  frequent  difficulty  in  using 
D&C,  even  for  sequential  program  synthesis,  is  the  synthesis  of  the  combine  operator,  i.e., 
the  program  to  handle  the  boundary  conditions  when  combining  partial  solutions.  If  we 
augment  the  specification  to  require  that  the  scdution  to  each  subproblem  supply  both  the 
desired  information  and  a  closure  that  performs  the  gluing  operation  at  the  boundaries, 
the  synthesis  task  becomes  easier. 

7.3.5  Miscellaneous  Techniques 

If  TRANSCONS  is  limited  to  a  few  specific  techniques  it  will  not  be  a  very  useful  tool.  It 
will  synthesize  a  limited  set  of  parallel  structures  from  their  specifications,  but  it  will  not 
extend  well  because  it  will  not  enjoy  the  use  of  general  mathematical  knowledge. 

We  show,  however,  that  general  mathematical  knowledge  can  work  well  with  the 
TRANSCONS  concurrency  knowledge  to  produce  better  parallel  structures  than  would 
otherwise  be  achievable.  In  one  case,  the  synthesis  of  networks  for  solving  a  version  of 
the  Connected  Components  problem,  use  of  set  theory  axioms  and  axioms  describing  the 
transitivity  of  the  “same  connected  component”  relation  is  vital.  In  another  case  we  ex¬ 
plore,  the  synthesis  of  circuits  for  the  addition  of  binary  integers,  we  show  purely  algebraic 
techniques  without  which  n  bit  integers  required  O(n’)  computing  elements  to  solve.  The 
use  of  a  series  of  techniques  allows  this  to  be  successively  reduced  to  9(nlog  n)  and  then 


to  6(n).  Further  techniques  allow  the  direct  synthesis  of  a  slower  network  with  9(n)  el¬ 
ements,  and  the  use  of  aggregation  allows  this  to  be  reduced  to  a  circuit  with  a  constant 
amount  of  logic. 


7.4  Future  Work 


There  is  a  lot  of  effort  required  before  this  work  can  be  considered  complete.  We  will 
describe  this  work  from  the  most  direct  extensions  of  work  already  performed  to  that 
portion  most  requiring  future  theoretical  findings.  The  latter  part  will  be  described  in 
more  detul  under  separate  headings. 

First,  some  of  the  basic  knowledge  of  TRANSCONS  has  to  be  codified.  Recent  improve¬ 
ments  to  the  CHI  system  to  improve  the  efficiency  of  knowledge  storage  and  retrieval,  and 
facilitating  inclusion  of  theorem  proven,  impel  modifications  to  the  crystalline  synthesis 
section.  The  tree  rules  and  data  structures  exist  only  in  the  most  rudimentary  form  because 
of  the  newness  of  the  conception  of  the  use  of  closures  for  this  purpose. 

This  will  yield  a  TRANSCONS  in  which  the  designer  has  a  savant  assistant  at  his  dis¬ 
posal,  but  (s)he  must  supply  assertions  and  inventions  at  critical  places.  The  inclusion 
of  backwards  inference,  or  the  inference  of  the  form  of  a  functon  from  assertions  about 
its  behavior,  will  make  especially  the  tree  synthesis  portion  of  TRANS  CONS  much  more 
capable  of  performing  on  its  own. 

There  are  also  four  more  important  extensions  to  the  framework  that  require  some  fun¬ 
damental  work.  When  this  is  done,  we  claim  that  the  new  rules  will  mesh  well  with  the 


prototype  system. 
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7.4.1  Routing  Problems 

Suppose  we  need  sn  implementation  of  the  APL  statement  A[/]  =  B,  where  all  are  vectors 
and  where  it  is  known  that  no  two  values  of  I  are  the  same  but  all  are  within  range.  In 
short,  suppose  we  want  to  perform  a  general  permutation  where  each  procesaor  knows 
where  to  send  its  value.  (The  similar  case,  A  =  B[/],  in  which  each  processor  knows  where 
it  wants  to  get  its  value  from,  can  be  handled  by  anerting  that  each  processor  knows  where 
to  send  a  closure  to  and  having  the  closure  accept  a  value.)  This  problem  is  closely  related 
to  sorting. 

There  are  several  parallel  structures  for  the  permutation/sorting  problem.  Particularly 
well-known  structures  include  the  flashsort  [ReV82],  Batcher’s  Sort  |Bat68],  and  an  in¬ 
teresting  relatively  new  one  [AKS83].  These  are  all  couched  in  the  language  of  sorting. 
There  is  also  a  Benes  network  available;  it  takes  longer  to  compute  Benes  network  set¬ 
tings  than  it  takes  for  the  other  networks  to  work,  but  once  the  settings  are  computed 
the  permutation  is  faster  [NaS82].  The  settings  can  be  represented  by  a  word  slightly 
smaller  than  twice  the  size  of  a  processor  ID  (and  they  can  replace  the  latter,  which  can 
be  computed  from  the  former).  This  can  be  used  when  the  same  permutation  will  be  used 
repeatedly. 

We  have  a  situation  that  the  data  flow  information  would  seem  to  require  a  complete 
interconnection  among  the  processors  in  the  case  where  each  of  the  processors  holds  one 
of  the  values  to  be  permuted  and  is  expected  to  hold  one  possibly  distinct  value  after  the 
permutation.  It  requires  deeper  knowledge  than  flow  analysis  to  derive  a  net  such  as  the 
shuffle  interconnection  that  is  less  than  complete,  but  that  can  accomplish  a  permutation 
rapidly.  Ciodification  of  what  it  is  that  a  person  does  when  he  proves  that  a  smaller  network 
can  be  adequate  for  the  job,  is  a  good  subject  for  future  investigation. 
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7.4.2  Average-  vs.  Worst-case  Behavior 


To  develop  a  good  practical  solution  to  the  permutation/sorting  problem  may  require 
ability  to  reason  about  average-  vs.  worat-case  behavior.  For  example,  there  is  an  0(log  n) 
time,  0{n)  processor  sorting  algorithm  with  very  reasonable  constant  factors  that  has 
only  one  problem  -  it  fails  with  small  and  asymptotically  decreasing  probability  [B.eV82]. 
Another  solution  with  the  same  asymptotic  behavior  and  which  never  fails  is  described 
in  {AKS83],  but  the  constant  factors  are  comparable  to  Avogadro’s  constant,  making  an 
implementation  obviously  impractical. 


Our  synthesis  system  should  reject  the  latter  solution  in  favor  of  the  former  in  cases  where 
O(logn)  time  and  0(n)  processor  count  is  required.  We  must  solve  questions  of  what 
heuristics  to  use  to  generate  such  parallel  structures  in  cases  where  sufficiently  good  worst- 
case  behavior  can  not  be  achieved,  and  when  to  accept  the  disparity  and  report  the  derived 
circuit  as  meeting  the  designers’  requirements. 


7.4.3  Efficiency  Estimation  for  Parallel  Structures 


A  third  problem  is  extension  and  integration  of  an  efficiency  expert.  It  has  been  shown 
[Kan79]  that  sequential  programs  that  arise  in  practice  can  be  analysed  and  an  estimate  of 
performance  derived.  This  is  especially  true  where  the  program  is  synthesized  from  higher 
level  specifications  and  the  synthesis  process  is  designed  with  the  needs  of  the  efficiency 
expert  in  mind.  It  is  not  known  whether  the  same  is  true  for  concurrent  specifications, 
with  interactions  among  processors  making  timing  analysis  more  difficult  and  with  more 
performance  criteria  than  sequential  programs,  but  it  seems  likely. 


7.5  Accomplishments 


The  main  tangible  accomplishment  of  this  work  is  the  beginning  of  TRANSCONS,  a  VLSI 
design  assistant.  Within  the  system’s  limitations,  a  designer  can  write  input/output  speci¬ 
fications  in  first  order  logic,  guide  TRANS  CONS  at  some  critical  places,  and  produce  a  top 
level  block  diagram  of  a  circuit  that  will  have  that  input/output  behavior.  The  blocks  it 
uses  are,  themselves,  specifications  that  can  be  exposed  to  the  same  process. 

TRANSCONS’s  has  two  domains  of  applicability:  The  first  is  the  synthesis  of  crystalline 
parallel  structures  that  are,  loosely  speaking,  those  in  which  the  various  functional  blocks 
or  processors  bear  fixed  relationships  to  their  neighbors.  The  second  is  the  synthesis  of 
tree  structures. 


In  addition  to  providing  design  functions,  TRANSCONS  provides  a  framework  upon  which 
a  structure  of  further  analysis  tools  can  rest.  We  demonstrate  this  by  exhibiting  a  synthesis 
that  depends  on  some  forms  from  set  theory,  and  another  that  depends  on  certain  theorems 
concerning  quantifiers.  These  can  either  be  entered  by  a  human  as  axioms,  entered  as 
hypotheses  to  be  proven  by  an  internal  theorem  prover,  or  found  by  the  system  using  the 
^weakest  precondition”  work  of  [Sml83b].  TRANS  CONS  is  designed  to  grow  by  accepting 
new  transformation  rules,  new  heuristics  as  to  what  rules  to  select  for  ^>plication,  or  new 
theorem  proving  technology. 
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Appendix 

A  Trans  Cons  Usage  Examples 
A.l  Usage  Conventions  for  V 

Td  enter  a  V  program  and  start  working  on  it,  one  starts  up  the  CHI  system  in  the  manner 
prescribed  for  your  machine.  Currently  only  INTERLISP/TOPS-20  and  SYMBOLICS 
36xx  implementations  are  available. 

One  can  then  give  directives  to  the  CHI  system.  Such  directives  are  in  the  syntax  of  LISP, 
except  that  in  the  function  name  position  there  will  usually  be  the  locution  *'#>’’  which 
has  two  purposes:  it  commands  the  underlying  LISP  system  to  interpret  the  input  stream 
according  to  V  syntax  rules,  and  it  requests  CHI  to  parse  the  V  program  into  an  abstract 
syntax  tree. 

In  what  follows  the  numbers  followed  by  a  dot  are  prompts;  they  are  typed  out  by  the 
computer. 

1.  (#  >  (the  V  program)  ) 

V  analyzes  the  program  and  either  accepts  it,  in  which  case  the  whole  program  becomes 
the  current  node,  or  rejects  it,  giving  an  error  message  in  which  the  location  of  the  error 
is  highlighted.  As  with  most  compilers,  the  actual  error  can  be  in  a  different  place  from 
where  the  compiler  thinks  it  is.  Unlike  with  other  compilers,  however,  it  is  possible  to 
get  more  information  from  the  system  concerning  the  location  of  the  error.  Simply  type 
(resume)  (LISP  machine  implementation)  or  (RETURN)  (INTERLISP  implementation) 
to  be  shown  another  error  guess. 
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The  basic  operation  on  nodes  is  rule  application.  Other  operations  include  focus  changing 
(to  narrow  the  focus  of  attention  from  a  whole  program  to  one  of  its  parts  or  vice  versa) 
and  various  methods  of  printing  out  nodes. 

In  addition  to  the  current  nodes  there  are  named  nodes.  It  is  possible  to  make  a  named 
node  by  suppling 

2.  (#  >  RULE  programname  (the  rest  of  the  program)  ) 

After  doing  this,  the  name  programname  is  attached  to  the  node  created  by  the  reader. 
A  node  must  be  named  to  not  disappear  when  you  make  a  new  current  node. 

There  are  several  commands  for  manipulating  the  current  node  or  named  nodes,  or  for 
printing  them  out  in  various  formats.  The  HELP  function  of  the  CHI  system  will  help  you 
find  them,  but  we  summarize  some  of  the  more  important  ones  below; 

(PU  F)  Prints  the  current  node  Using  the  F  format  (as  a  list  of  properties).  Not 

usually  necessary  except  for  TRANSCONS  implementors. 

(PU  I)  prints  the  node  as  a  V  expression  (in  the  same  format  as  it  could  be  read 

in).  I  stands  for  Infix  notation. 

(MCN  L)  interprets  a  locator  L  and  makes  it  the  Current  Node.  L  can  either  be 

the  name  of  a  named  node  or  a  prototype,  a  number,  or  an  index  into  the 
current  node.  Such  an  index  is  either  a  positive  integer  if  the  current  node 
is  a  list,  a  property  name  if  the  current  node  is  a  general  object,  or  0  if  the 
current  node  has  a  parent. 

If  the  locator  is  the  name  of  a  named  node  or  a  prototype,  that  object 
is  made  the  current  node.  If  the  current  node  is  a  list,  than  a  locator  of 
I  makes  the  i**  element  of  that  node  the  current  node.  If  it  is  a  general 
node  having  a  property  p  than  a  locator  of  p  will  make  the  current  node 


the  value  of  the  p  property.  Finally,  a  locator  of  0  will  select  the  parent  of 
the  current  node  as  the  current  node. 

It  is  quite  possible  for  any  or  all  of  the  last  three  types  of  selector  to  be 
applicable. 

PU  can  be  used  with  a  second  argument.  If  this  is  done  than  the  object 
found  by  the  locator  is  printed,  not  the  current  node. 

The  basic  operation  for  making  a  named  node  is  (DEFY  (name)  (type)  ...).  This 
text  should  appear  in  a  ZWEI  buffer  or  a  file  and  it  should  be  EVALuated  or  LOADed, 
respectively.  Procedures  for  doing  this  are  found  in  the  LISP  Machine  Manual.  Valid 
(type)s  include  RULEs,  PROGRAMS,  OPERATIONS  and  PROPERTYs.  OPERATIONS 
and  PROPERTYs  are  primarily  the  concern  of  implementers. 

RULEs  are  used  to  define  input/output  specifications  for  TRANSCONS  (or  any  other 
system  implemented  in  CHI).  The  TRANSCONS  system  consists  of  CHI  together  with 
other  files  consisting  of  rules  (and  some  properties  and  operations). 

A  rule  always  has  a  name.  Whenever  a  node  to  which  a  rule  named  name  should  be  applied 
is  the  current  node,  incant  (AR  name)  to  apply  the  rule.  The  result  of  the  rule  application 
will  be  displayed.  (If  the  attempt  at  application  was  incorrect,  and  the  rule  didn’t  apply 
to  that  node,  the  system  prints  "rule  name  did  not  apply”,  instead.) 
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A.2  Specific  Rules  for  TransConS 

The  first  step  in  transforming  a  specification  to  a  parallel  structure  via  TRANSCONS  is  to 
express  the  specification  in  V  and  enter  it  into  the  system.  The  process  of  entering  the  spec¬ 
ification  in  V  is  briefly  discussed  here  and  discussed  more  fully  in  [GreSl],  [BKPW84]. 

TransConS  operates  on  speciflcations  whose  primary  data  structures  are  arrays.  In 
general  a  higher  level  data  structure  would  be  selected  for  expressing  the  task,  so  the 
higher  level  structure  has  to  be  transformed.  The  Chi  system,  in  which  TRANSCONS  is 
embedded,  has  facilities  for  performing  this  transformation.  See  |Kot84]. 

In  the  following  discussion  we  will  use  the  matrix  multiplication  problem  to  show  the 
various  techniques  and  intermediate  states.  The  initial  specification  is: 

Vi  ,*  e  {l,...,n} 

Vy  Je{l . n} 

Ciy  ^  Efc,fc€{i . n}  AikBkj 

To  describe  the  bounds  of  arrays  and  of  the  various  structures  they  are  transformed  into, 
TransConS  has  an  enumeration  object  that  is  slightly  less  general  than  that  of  V  to 
allow  for  easy  handling  by  a  theorem  prover.  This  is  described  below. 

A.2.1  Multiple  Objects 

The  basic  stuff  of  TRANSCONS’s  parallel  structures  is  the  multiple  object.  Examples  of 
these  include  array  elements,  processors,  and  clumps  of  processors  referred  to  by  HEARS 
clauses. 

We  use  bound  variable  lists  and  enumerators  to  control  these  multiple  objects.  We  do 
not  supply  complete  generality,  because  this  would  lead  to  certain  problems  in  proving 
theorems  about  the  domains,  ranges,  intersections,  unions  and  subset  properties  of  the 
sets  enumerated  by  these  constructs.  An  example  of  u  inadmissible  form  is 


A  istype  ARRAY  ((  x  c)  ,be  S,  eST 


The  multiplication  is  inadmissible  because  admitting  it  would  create  problems  where  ques¬ 
tions  about  induced  sets  would  be  undecidable.  We  limit  the  sets  to  those  definable  in 
Presburger  Arithmetic  without  multiplication,  although  we  allow  the  bound  variable  of 
an  outer  quemtifier  to  be  considered  a  constant  in  inner  quantifiers.  This  turns  out  to  be 
slightly  too  restrictive  because  it  is  impossible  to  specify  a  tree  finitely  in  this  language,  so 
we  also  include  special  constructs  designed  solely  to  specify  trees. 


Valid  bound-variable-list/enumerator  pairs  for  object  XXX  are 


of 


the  form  XXX  * ,  (enumr)  where  an  (expr)  is  linear  in  all  of  the  bound  variables 

<<spr) 

of  the  enumerations  and  an  (enumeration)  is  of  the  form  (hound— variable)  £  (set),  where 
a  (set)  is  in  turn  a  set  definable  in  Presburger  arithmetic.  This  means  that  integer  sub¬ 
ranges  and  equality  modulo  a  constant  are  the  basic  building  blocks,  and  that  definite 
union  and  intersection  are  allowed. 


A.2.2  Inclusion  of  ARRAY  declarations 

It  is  necessary  to  decorate  the  program  with  ARRAY  declarations  to  allow  for  a  proper 
synthesis  of  multiple  processors^.  It  is  not  necessary  to  declare  idl  arrays;  only  those  which 
you  would  like  the  system  to  consider  expanding  into  a  multiprocessor  configuration. 

There  are  two  methods  for  inserting  the  necessary  ARRAY  declarations.  One  is  to  edit  the 
ascii  form  of  the  specification  (the  one  in  the  file)  and  have  it  reread  by  the  CHI  reader.  An 
alternative  is  to  insert  the  ARRAY  declaration  at  the  appropriate  point  of  the  specification. 
First  the  program  must  be  enclosed  in  a  binding  block  if  it  is  not  already  in  one.  Then  go 
to  some  point  in  the  bound  variable  set  and  use  the  command  (:  (arraydeelaration)).  The 

‘In  the  future,  this  step  will  be  nutometed,  at  least  where  the  bounds  of  arrays  can  be  determined  at 
‘compile  time”  by  data  flow. 
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array  declaration  will  be  read  by  the  V  reader  and  inserted  at  the  appropriate  point.  It  is 
conventional  to  place  all  of  the  array  declarations  at  the  beginning  of  the  program,  but  it 
doesn’t  matter  as  long  as  they  are  at  the  appropriate  level  of  block  structure,  usually  the 
top.  We  prefer  editing  the  source  because  doing  this  documents  the  session  in  a  permanent 
form. 

There  are  ordinary  ARRAY  declarations  describing  arrays  of  data  that  are  computed 
during  the  course  of  a  problem,  and  there  axe  I/O  ARRAY  declarations  that  describe 
arrays  of  data  that  come  from  or  go  to  the  outside  world.  I/O  arrays  should  always  be 
declared  because  there  are  important  efficiency  issues  that  will  be  ignored  if  they  are  not. 
These  issues  are  principally  those  of  excessive  interconnections  being  necessary  between 
the  I/O  chaimels  and  the  outside  world.  It  will  never  be  assumed  that  an  I/O  array  is  ever 
resident  in  more  than  one  processor.  For  the  matrix  multiplication  problem  a  proper  array 
declaration  is: 


bind 

A  istype  INBOUND  ARRAY  (ij)  ,*  €  {1, . . . , n),  j  €  {1, . . . , n} 

B  Istype  INBOUND  ARRAY  \ij)  ,i  €  {!,.  ..,n},  j  €  {1 . n) 

result  istype  OUTBOUND  ARRAY  {ij)  ,n),  jG  {!,..., n} 

C  istype  ARRAY  {ij)  ,i  €  {1 . n),  j  €  n} 

do 

Vj  ,tG{l,...,n} 

V,  ,y€  {!,... ,n} 

^  Efe,se{l . n} 

resultij  4—  Cij 


The  inbound  and  outbound  arrays  are  declared  separately.  There  is  yet  another  declaration 
for  the  array  because  otherwise  the  system  would  not  be  willing  to  allocate  separate 
processors  to  the  elements  of  the  answer.  This  separate  array  is  called  a  shadow  array. 
Shadow  arrays  can  be  created  automatically  by  a  rule  named  MAKE-SHADOW-ARRAYS. 
Simply  type  (AR  MAK&SHADOW-ARRAYS)  if  a  shadow  array  for  each  I/O  array  is 
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desired.  In  this  case  (for  simplicity  of  exposition)  we  choose  not  to  do  this;  instead  we  add 
a  single  shadow  array  by  hand^. 

A.2.3  Inclusion  of  PROCESSORS  declarations 

After  you  include  all  of  the  ARRAY  declarations  for  arrays  that  the  system  should  ex¬ 
pand,  you  can  have  the  system  introduce  tentative  PROCESSORS  declarations.  An 
alternative,  to  be  discussed  in  a  later  Section,  is  to  perform  a  virtualization,  which  must  be 
done  before  the  PROCESSORS  declarations  are  introduced.  The  rules  that  introduce 
PROCESSORS  declarations  are  called  MAKE-IOPSs  and  MAKE-PSs. 


^  /■ 
. .  —  -  - , 

t 


The  console  session  looks  like  this 


A  iatype  INBOUND  ARRAY  (ij)  ,i  e  {l,...,n},  je{l,...,n} 
INBOUND  ARRAY  B.y  ,i  €  {1, . . .  ,n},  j  €  {1, . . .  ,n} 
OUTBOUND  ARRAY  reaultij  ,i  €  {l,...,n},  jf€{l,...,n} 

C  iatype  ARRAY  (ij)  ,i  €  i€{l,...,n} 

Vi,,€{l,...,n} 

Vy  {1,  ...n) 

reaultij  -Cij 


.  (AR  MAKE-PSs) 

INBOUND  ARRAY  Ay  i*  €  {l,...,n},  j€{l,...,n} 

INBOUND  ARRAY  B.y  ,.'€  {l,...,n},  j€{l,...,n} 

OUTBOUND  ARRAY  reaulUj 
C  iatype  ARRAY  (ij)  ,  i  €  {1, . . . ,  n},  j  €  {1, . . . ,  n} 

PROCOOl  iatype  PROCESSORS  (ij)  ,i€  {l,...,n},  j  e  {l,...,n} 

HAS  C,y 

V,  ,i  6 

Vy  ,j  e 

^kBk} 

'To  be  strictly  correct  and  to  allow  to  nse  all  of  the  "intelligence*  at  its  dispoeal,  we  should 
probably  have  assigned  separate  shadow  arrays  to  An  and  B,>,  too.  On  a  problem  of  this  simplicity  this 
would  be  pedantic.  There  is  a  simple  correspondence  between  the  final  result  achieved  by 
after  all  rule  applications  when  MAKE-SHADOW-ARRAYS  is  used  and  the  one  we  will  obtain  here,  but 
the  former  is  visually  much  more  complex. 


%  ,.,J 


-  J 
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re»ultij  *-  Cij 

.  (AR  MAKE-IOPSs) 


INBOUND  ASBAY  Ai,-  ,t  £  {l,...,n},  y€{l,...,n} 

PROC002  istype  PROCESSORS  HAS  A,,t  €  {1, . . .  ,n},  j  €  {1, . . .  ,n} 
INBOUND  ARRAY  Bij  ,i  €  j  € 

PROCOOZ  Istype  PROCESSORS  HAS  B„  ,i  £  y£ 

OUTBOUND  ARRAY  reaulUj 

PROCOOi  istype  PROCESSORS  HAS  reavltij  ,*  €  {l,...,n},y  £  {1 . n} 

C  istype  ARRAY  (iy)  ,i  £  y£{l,...,n} 

PROCQOl  istype  PROCESSORS  (iy)  ,i£  y£{l,...,n} 

HAS  Cii 
V.  ,i£{l,...,n} 

Vj  ,y  €  (1, . . . ,n} 

result, -j-  *—  Cij 


There  are  several  things  to  note:  User  input  (shown  in  roman)  is  not  cas»*sensitive,  so 
any  combination  of  upper  and  lower  case  can  be  used.  Output  is  shown  in  other  fonts. 
As  present  TRANSCONS  uses  GENSYM’s  for  processor  family  names  rather  than  forming 
elegant  ones  out  of  the  name  of  the  corresponding  variable.  Only  the  family  PROCOOl 
contains  more  than  one  processor.  The  USES  and  HEARS  clauses  have  not  been  filled 
in  yet. 


_  j 

A.2.4  Adding  the  HEARS/USES  Clauses 


After  this  it  is  necessary  to  add  the  HEARS  clauses  and  their  USES  subclauses.  The 
rule  that  does  this  is  called  MAKE-USES*HEARS. 


The  dialog  continues... 


O 
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.  (AR  MAKE-USES-HEARS) 

A  btype  INBOUND  ARRAY  (ij)  ,»  e  {!,... ,n},  y€{l,...,n} 

PROCOai  istype  PROCESSORS  HAS  ,ie{l . n},  j€  n} 

B  fatype  INBOUND  ARRAY  {ij)  ,i  e  {l,...,n},  i€{l,...,n} 

PROCOOH  istype  PROCESSORS  HAS  Bij  ,i  €  {l,...,n},  j  €  {l,...,n} 
result  Istype  OUTBOUND  ARRAY  (ij) 

PROCOOA  istype  PROCESSORS  HAS  retulUj  ,t  e  {1, . .  .,n},j  e  {1, . . .  ,n} 

HEARS  PROCOOlij  ,i  €  {1 . n},  j  €  {1 . n} 

(USES  Cij  ,i  e  n},  j  €  {1 . n}) 

C  istype  ARRAY  (ij)  ,t  G  n},  yG{l,.-.,n} 

PROCOOl  tetype  PROCESSORS  (ij)  ,i€{l,...,n),  jG  n} 

HAS  C.y 

HEARS  PROC002  (USES  Ajk  ,ik  G  n» 

HEARS  PROC003  (USES  Bjtj  ,fcG  {l,...,n}) 

>*'  ^  {1 . **} 

Vj-  ,j  G  (1, . . .  ,n} 

reeultij  *—  Cy 


A.2.5  Clause  Reduction:  Simple  and  Complex 

The  next  step  is  reducing  the  awesome  connections  to  PROCOOA  (and  those  to  PROC002 
and  PROC003  which  become  evident  when  the  SENDS  clauses  are  added).  Normally 
REDUCE>HEARS  can  be  used,  but  it  won’t  work  here. 


*  V*  '■.*  • 


< 


.  (AR  REDUCE-HEARS) 

Rule  failed  to  apply. 

If  there  were  a  clause  eligible  for  such  treatment,  the  operator  would  have  used  the  CHI 
structure  editor  to  find  the  clause  and  add  the  reducible  property  to  it.  He  would  have 
done  this  by  using  the  editor  to  make  the  reducible  clause  the  current  node,  applying  the 
rule  MARK-HEARS-CLAUS&AS-REDUCIBLE  to  the  node,  and  then  making  the  whole 


I  i 


*-» .V  »**  fc  -*■ 
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program  the  current  node  again.  This  waa  not  done  -  there  was  no  reducible  HEARS 
clause. 


The  solution  to  this  lies  in  a  rule  that  partitions  the  induced  sets  of  instantiations  of  a 
HEARS  clause  in  such  a  manner  that  the  clause  can  be  replaced  by  a  pair  of  clauses; 
one  clause  to  connect  each  partition  to  the  outside  world,  and  one  that  builds  a  chain 
within  the  partition  to  distribute  the  data.  The  teleaeoping  property  [Kln82]  is  necessary 
to  validate  the  transformation;  at  present  the  operator  has  to  give  a  little  help. 


.  (PUI) 

A  istype  INBOUND  ARRAY  {ij)  ,i  €  {l . n},  j  e  n} 

PROC002  istype  PROCESSORS  HAS  A,  ,i  €{!,.. .  ,n},  j  e  {1, . . .  ,n} 
B  istype  INBOUND  ARRAY  (ly)  ,t  €  ye  n} 

PROCOOZ  istype  PROCESSORS  HAS  ,»€  {l,...,n},  ye  {!,... ,n} 
result  istype  OUTBOUND  ARRAY  (iy) 

PROC004  istype  PROCESSORS  HAS  reaulUj  ,•  e  {!,... ,n},ye 
HEARS  PROCQOlij  ,  •  e  {1, . . . ,  n},  y  e  {1, . . . ,  n} 

(USES  C|j  ,i  €  (1,. . .  ,n},  y"  e  (l,. . .  ,n}) 

C  istype  ARRAY  (ly)  ,i  €  y  e  {l,...,n} 

PROCQOl  istype  PROCESSORS  (ly)  ,ie  {1, ..•>«},  y  e  n} 

HAS  Cij 

HEARS  PROC002  (USES  A,fc  , ike  n}) 

HEARS  PROCOOZ  (USES  fl*,  ,ifc e  n}) 

V<  ,ie 

V,  ,i €  {!,...,?»} 

Sk,ke{l.. ..,!»}  ^kBkj 
reaultij  *—  Cij 


.  STEPS 

(  . . .  lists  the  parts  of  the  program 

) 


.  8 

PROCOOl  istype  PROCESSORS  (ly)  ,t  e  n},  y  e  n} 

HAS  Cij 


HEARS  PROC002  (USES  Aik  {1 . «}) 

HEARS  PROCOOZ  (USES  Btj  ,k€  {l,...,n}) 

.  HEARS-CLAUSES 

(HEARS  PROC002  (USES  A.k  G  {1, •••,«}) 

HEARS  PROCOOZ  (USES  B*,  .ibe  {1. •••.«») 

.  1 

HEARS  PROC002  (USES  A*  ,*  € 

.  (AR  MARK-HEARS-CLAUSE-FOR-TELESCOPING) 

Please  give  a  V  expression  for  the  telescopes 
(#>  PROC002..* 

Please  give  a  V  expression  condition  for  a  processor  to  be  connected  to  the 
outside  world 

(#>  PROCOO2.1) 

Please  give  a  pair  of  V  expressions  describing  the  connections 
(if:  >  PROC002,.*)  HEARS  (#  >  PROC002,  j.,) 

The  marking  process  causes  no  change  in  the  appearance  of  the  program  under  a  (PU  I). 
The  new  properties  are  invisible.  They  would  show  up  imder  a  (PU  F)  performed  when 
the  current  node  was  the  marked  HEARS  clause,  and  of  course  they  are  visible  to  the 
rules. 

Now  that  properties  have  been  attached  to  the  chosen  HEARS  clause  two  rules  can  be 
applied.  One  rule,  MAK&CHAIN,  produces  a  new  HEARS  clause  building  the  chains 
of  processors;  the  other,  INPUT-ONLY-AT-BEGINNING-OF-CHAIN,  takes  advantage  of 
this  to  reduce  the  power  of  the  original  clause  by  giving  it  a  condition  that  makes  it 


apply  only  to  the  processor  at  the  beginning  of  a  chain.  This  also  moves  the  — USES 
clause,  removes  the  now-spurious  properties,  and  does  a  bit  more  housekeeping.  There 
is  an  analogous  rule,  OUTPUT-ONLY-AT-END-OF-CHAIN,  that  reduces  the  power  of  a 
HEARS  clause  in  an  output  device. 

The  function  of  exploiting  telescopes  to  reduce  a  HEARS  clause  is  split  into  two  rules  for 
two  reasons:  the  second  rule  is  different  for  input  and  output,  and  it  will  not  always  be 
necessary  to  perform  both  operations.  If  a  chain  is  already  there  one  of  the  I/O-ONLY... 
rules  can  be  applied. 

The  dialog  continues. . . 


.  0 

.  0 

.  0  to  get  to  the  top  level  (computer’s  printouts  omitted  for 
brevity.) 

.  (AR  MAKE-CHAIN) 

PROCQOl  istype  PROCESSORS  (ij)  ,»€  {!,..., n),  y’e  {!,..., n) 
HAS  Cij 
if  y  >  1  then 

HEARS  PROCOOlij^i 

HEARS  PROC0O2  (USES  A*  ,jfcG  {!,..., n}) 

HEARS  PROCOOZ  (USES  Bkj  ,*€  {1. •••."» 

.  (AR  INPUT-ONLY-AT-BEGINNING-OF-CHAIN) 

PROCQOl  istype  PROCESSORS  (ly)  ,i  €  {l,...,n},  y'€  {!,..., n} 
HAS  Cij 
if  y  >  1  then 

HEARS  PROCQO\ij-i 
if  y  >  1  A  y  <  n  then 

LINKS  PROCOOlij-i,  PROCQOlij+i 
(PASSES  Aik  n}) 

if  y  =  1  then 


HEARS  PROC002  (USES  >4.*  .fcG 
HEARS  PROCOQZ  (USES  B*,  , ib n}) 

Vbt/a!  The  size  of  the  induced  set  of  HEARS  PROC002 . . .  has  been  reduced  from  0[n) 
to  1. 


A.2.6  State  of  the  Implementatioii 

* 

We  divide  the  state  of  the  implementation  into  three  parts;  that  which  is  done,  that  which 
is  just  engineering  and  likely  to  be  done  within  a  few  months,  and  that  which  may  require 
more  serious  thought. 

Already  Done 

We  have  integrated  most  of  the  “prototypes”  (internal  descriptions  of  processor  structures) 
and  rules  for  lattice  synthesis,  except  for  the  selection  of  individual  processors’  programs, 
into  CHI  and  have  performed  test  syntheses.  The  places  where  a  theorem  prover  is  neces¬ 
sary  have  been  replaced  by  a  dialog  with  the  user  such  as  “Is  ...  a  theorem?”  and  “What 
must  the  expression  A  be  to  make  P(A)  a  theorem?”. 

Immediate  Next  Steps 

Next  we  integrate  the  prototypes  for  tree  structures,  and  those  rules  for  same  that  do  not 
require  backwards  inference.  A  theorem  prover,  probably  LMA,  will  be  connected  to  the 
(THEOREM  . . .)  calls  in  the  lattice  synthesis  section.  There  is  one  instance  of  backwards 
inference  in  the  lattice  synthesis  section,  but  this  need  can  probably  be  met  with  an  ad  hoe 
procedure. 

We  will  implement  Mighty  Mouse  (see  below). 


And  Finally, 


•  ee 


We  integrate  backwards  inference,  which  will  be  able  to  find  a  form  that  has  a  given 
property.  This  is,  of  course,  a  difficult  problem,  but  one  in  which  some  progress  has  been 
made.  Research  u  being  done  in  this  because  of  its  importance  to  divide  &  conquer.  See, 
for  example,  [Sn^82]. 

In  addition  we  will  supply  rules  for  the  rephrasing  of  array  problems  as  closure  computation 
problems. 

These  steps  together  will  make  the  tree  synthesis  subsystem  complete. 

A.2.7  MightyMouse 

Exploring  a  fairly  large  program  in  Chi  in  general  and  TRANSConS  in  particular  can  be 
a  laborious  undertaking  because  of  the  need  to  move  around  the  structure.  To  do  this 
the  user  has  to  know  what  property  he  wants  to  use  for  his  descent.  This  is  rather  more 
internal  knowledge  than  we  feel  that  a  user  should  need;  he  should  have  a  fair  idea  of  what 
text  he  wants  to  think  about,  but  how  the  levels  of  the  tree  divide  and  what  the  names  of 
things  are  should  not  be  part  of  this  knowledge. 

MightyMouse  is  entered  by  evaluating  (mmouse)  to  edit  the  current  node  or  (mmouse 
{node))  to  edit  (node).  When  you  do  this,  two  panes  will  be  fiashed  on  the  screen: 

•  the  Position  Pane,  which  is  used  for  noodling  through  the  data.  This  pane  shows 
a  prettyprinted  version  of  the  data.  Parts  of  it  are  “mouse-senntive” .  This  means 
that  whenever  you  position  the  mouse  so  that  the  pointer  points  to  a  character,  the 
computer  will  know  what  commands  apply  to.  The  indicated  text  is  the  smallest 
block  of  text  corresponding  to  one  tree  node  that  includes  the  character  pointed  to 
by  the  mouse. 


WheD  you  "click”  on  indicated  text  a  tree  node  is  chosen  as  follows:  the  left  button 
selects  the  highest  tree  node  represented  by  the  text,  the  middle  button  selects  an 
intermediate  node,  and  the  right  button  selects  the  lowest.  If  there  are  more  than 
three  choices  and  the  middle  button  is  used,  a  menu  "pops  up”  that  invites  the  user 
to  select  a  choice. 

Once  a  node  is  selected  a  menu  pops  up  with  several  options:  you  can  make  the  node 
current,  select  a  subnode,  select  a  supemode,  change  the  value  of  the  slot  that  the 
node  sits  in,  or  choose  to  apply  an  applicable  rule.  Rules  can  be  ^plied  in  automatic 
or  semiautomatic  mode.  In  automatic  mode,  the  rule  is  fully  applied,  i.e.,  it  is  applied 
at  all  places  where  it  can  be  applied.  In  semiautomatic  mode,  every  time  a  pattern 
match  is  successful  the  matching  node  is  highlighted  and  the  user  can  either  click  one 
button  to  apply  the  rule,  another  button  to  skip  that  application,  or  a  third  button 
to  exit  completely. 

•  the  History  Pane,  which  shows  the  last  sixteen  current  nodes.  You  can  make  any  one 
of  them  current.  The  history  pane  remembers  its  history  from  one  call  of  MMouse 
to  the  next,  so  several  unrelated  data  can  be  explored  together. 

B  Correctness  Considerations 

Below  we  show  the  formal  reasoning  required  to  validate  the  primary  rule  that  perforins 
HEARS  clause  reduction.  The  motivation  for  this  rule  is  the  fact  that  the  intercon¬ 
nections  inferred  from  the  data  flow  information  can  be  unacceptably  rich.  We  want  to 
reduce  these  interconections,  hence  the  name  REDUCE— HEARS  of  the  rule  primarily 
responsible  for  this  change.  We  want  neither  to  cut  off  a  proceaaor  from  information  it 
needs  to  compute  its  answer,  nor  to  create  such  circuitous  paths  for  data  that  the  converted 
architecture  is  significantly  slower  than  the  original.  "Significantly  slower”  here  will  mean 
"slower,  by  more  than  a  constant  factor” . 
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The  REDUCE— HEARS  transfonnation  establishes  a  pipeline  from  one  processor 
through  a  series  of  other  processors  instead  of  a  wire  from  the  first  processor  to  each 
of  the  other  processors.  We  intend  to  show  that  the  use  of  this  transformation  neither 
renders  the  specification  incorrect  nor  less  efficient  (up  to  a  constant  factor).  We  will  use 
two  separate  theorems. 

The  first  theorem  claims  that  if  a  specification  is  correct,  here  meaning  that  all  of  the  data 
it  needs  to  do  its  work  is  available  at  some  time,  than  the  specification  resulting  from  an 
application  of  REDUCE— HEARS  will  also  be  correct  in  this  sense.  The  rule  does  not 
cause  other  changes  to  the  specification  except  for  the  replacement  of  HEARS  clauses 
with  smaller  EDBARS  clauses  in  combination  with  PASSES  clauses. 

The  first  lemma  to  the  second  theorem  makes  a  much  broader  claim  than  is  necessary 
merely  to  limit  this  rule  to  a  constant  factor  slowdown.  It  states  that  whenever  there  is 
a  collection  of  processors  such  that  the  longest  path  length  is  l{n),  the  largest  number  of 
values  computed  in  one  processor  is  v(n),  the  highest  in-degree  of  any  processor  is  t  (n)  and 
the  largest  amount  of  calculation  per  input  value  is  c(n),  the  connections  form  a  DAG,  and 
a  couple  of  other  reasonable  conditions  are  met,  then  the  runtime  of  the  parallel  structure 
is  at  most  0(t(n)v(n)/(n)c(n)).  The  second  theorem  will  then  claim  that  this  establishes 
a  speed-preserving  property  for  REDUCE— HEARS 


rule  REDUCE- HEARS  {stmt)  TRANSFORM 

atmt-.'PNAME  istype  PROCESSORS  {%PDV)  tPENUMER... 
if  CONDI  then 

HEARS  PNAMExhbv  tHENUMER 


( USES  UV^uBvtUEN, 

A  (THEOREM 

(JSl  =  {HEV  ;  HENUMER\PDV  \PDVi]}  ■  (1) 

A  7510  =  {HBV  :  HENUMHR[PDV  \PDV2]}  ;  (2) 

AIS2  =  {PDV  :  HENUMER/\HBV  =  PDV)  ;(3) 

A  PROCl  =  {PDVi}  ;(4) 

A  PROC2  =  {PDV}  ■  (5) 

A  PROCh  =  {HEXPR}  ■  (6) 
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A  PROChi  =  {HIEXPR) 

A  (/5in/51o)  €  {0  ISl  ISla} 

A  ((0  C  ISl  C  ISla  A  CONDI) 

=>  ISl\JPROCl  =  IS2) 

A  {COND2  ^  CONDI  A  ISl  U  PROCh  =  IS2) 

A  (~  COND2  =►  ^PilVsl/Sl  c  {HBV  ; 

\PDV3]}]) 

A  {CONDZ  ^  COND2  A  CONDl[PDV  \HIEXPR]) 

A  (ff /£XPP[PDV  \PPXPP]  =  PDV))) 

atmt  :*PNAME  istype  PROCESSORS  {tPDV)  tPENUMER  ... 

if  CO  JVD3 

LINKS  PN AMEfuEXPR,  AME^uiexpr 
( PASSES  UVtuBV^UEN, . . .) 

UCOND2 

HJiiARS  P N AM Ebexpr 


The  intent  of  this  rule,  especially  that  of  the  call  of  the  theorem  prover  that  makes  up  its 
bulk,  is  not  obvious.  We  will  explain  it  before  proving  theorems  about  it: 


As  in  all  TRANS  CONS  transformation  rules,  free  variables  on  the  left  hand  side  of  the 
rule  (in  this  case,  everything  above  the  are  implicitly  existentially  quantised.  The 

objects  ISl  through  PROChi  receive  setformer-valued  expressions,  some  of  which  (i.e., 
PROCh)  form  singleton  sets.  COND2  and  CONDZ  are  instantiated  to  boolean  predicates 
with  free  variables  chosen  from  PDV  by  a  similar  process.  ‘Backward  inference” ,  or  the 
determination  of  the  form  of  an  expression  from  assertions  about  its  value,  must  be  used 
in  four  cases;  to  get  values  for  HEX  PR,  HIEXPR,  COND2  and  CONDZ. 


The  setformer  expressions  of  the  theorem  above  give  us  sets  of  vectors  of  values,  whose 
dimension  is  the  same  as  PDV  which  is  the  vector  of  bound  variables  in  the  original  PRO¬ 
CESSORS  declaration.  The  subscripts  on  instances  of  PDV  in  the  setformer  produced 
distinguished  names,  so  (for  example)  if  the  first  element  of  PDV  is  t  then  the  first  element 
of  PDVi  is  t'l,  a  different  object  that  need  not  unify  to  the  same  thing  t  unifies  to. 


••  .* 
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Theorem  B.l  Suppo$e  REDUCE— HEARS  appliet  to  a  ytven  HEARS  elauae.  Then 
every  proeeaaor  of  the  apeeifieation  reauUing  from  the  application  will  have  all  data  available 
to  it  that  the  eorreaponding  proeeaaor  had  in  the  original  apeeifieation. 

Proof:  REDUCE— HEARS  applies  to  HEARS  clause  Ho-  The  THEOBEM  conjunct 
contaixis  a  large  conjunction  that  implies  several  things.  By  lines  1,  2  and  8  we  have 
that  two  induced  sets  of  Hq  (called  I  SI  and  IS  la)  are  either  disjoint  or  telescoping  (one 
contains  the  other).  By  1,  2,  S,  9  and  10  we  have  that  if  two  induced  sets  telescope  and 
one  is  strictly  smaller  than  the  other  there  is  a  singleton  set  (called  PROCl  which  we  can 
add  to  the  smaller  one  to  make  a  different  induced  set  (called  IS2).  Line  4  gives  the  name 
PDVi  to  the  indices  of  the  processor  comprising  the  singleton  set.  Line  6  asserts  that 
there  is  an  index  expression  HEXPR  that  generates  the  singleton  set  PROCh  and  that 
has  PDV  for  free  variables,  such  that  PROCh  indexes  the  processor  that  must  be  added 
to  751  to  get  IS2,  and  that  this  is  possible  whenever  COND2  is  true.  Line  12  asserts 
that  COND2  is  true  whenever  such  a  processor  can  be  found.  Lines  7  and  14  assert  that 
the  HEXPR  mapping  has  an  expression  that  is  its  inverse,  called  HIEXPR,  and  line  14 
asserts  that  a  given  processor  is  mapped  into  by  the  HEXPR  relationship  for  some  other 
processor.  It  must  be  shown  that  wherever  the  original  specification  had  a  HEARS  A 
clause  that  aUowed  data  to  flow  from  point  A  to  point  B  and  satisfy  a  USES  clause,  the 
new  specification  will  have  a  HEARS  C  clause,  and  either  C  =  A  or  there  is  an  unbroken 
chain  of  processors  that  PASS  the  data  from  A  to  B.  We  will  do  this  by  induction  on  the 
sise  of  the  induced  set  of  Hq. 

If  B  HEARS  one  processor  then  clearly  the  only  way  to  satisfy  lines  9-11  is  with  that 
processor  as  PROCh  (and  ISl  =  0).  If  B  HEARS  more  than  one,  then  it  hears  IS2  and 
by  9-11  there  is  a  processor  PROCh  whose  induced  set  is  751  such  that  ISl>ii PROCh  = 
752,  so  PROCh'a  induced  set  is  smaller  than  752.  PROCh  therefore  has  access  to  the 


We  further  know  that  PROCh  needs  information  available  only  from  751  and  from  all 
of  ISl.  The  panes  clause  of  the  third  line  from  the  bottom  of  the  rule  assures  that 
information  available  to  PROCh  will  be  available  to  PROC2  as  well.  This  completes  the 
induction  and  the  proof,  g 

Now  we  show  that  there  is  not  too  much  of  a  slowdown  assuming  certain  reasonable  re¬ 
strictions  on  the  computation  performed  in  the  processors. 

First  we  need  a  definition: 

Definition  B.l  Each  proeetaor  perfornu  a  computation.  The  form  of  the  computation  can 
he  repreaented  aa  a  tree.  Call  the  higheat  nodefa)  with  execution  time  of  0(1)  onteraost 
last  nodes.  Each  outermoat  faat  node  ia  either  the  higheat  node  of  the  computation  or  ia 
uaed  an  aaymptotieally  noneonatant  number  of  Hmea.  Call  the  aeta  of  valuea  uaed  by  the 
aeveral  uaea  of  outermoat  faat  nodea  last  sets. 

It  is  clear  that  the  size  of  a  fast  set  must  be  0(1). 

Now  we  can  show  that  the  asymptotic  performance  of  the  parallel  structure  that  results 
from  REDUCE~HEARS  is  equal  (to  a  constant  factor)  to  that  of  the  unreduced  parallel 
structure. 

Lemma  B.2  Suppoae  there  ia  a  eolleetion  of  proeeaaora  aueh  that  the  longeat  path  length  ia 
l(n),  the  largeat  number  of  valuea  computed  in  one  proeeaaor  ia  v(n),  the  higheat  in-degree 
of  any  proeeaaor  ia  i(n),  the  eonneetiona  form  a  DAG,  and  the  faat  aeta  are  di^oint  and 
contain  at  moat  one  datum  from  each  input  path.  If  uae  (by  the  encloaing  node)  of  the 
value  generated  by  a  faat  node  takea  0(1)  time,  and  if  a  proeeaaor  that  receives  at  moat  v 
valuea  on  any  of  ita  input  linea  sends  at  moat  v  copies  of  valuea  that  it  has  received  on  ita 
output  linea,  then  all  proeeaaora  finiah  their  foba  witAi'n  0(i(n)v’(n)I(n))  anits  of  time.  We 


«M«me  that  during  each  time  unit  a  proeeaeer  Mn  receive  one  value  from  each  input,  vend 
one  eopg  of  a  previouolg  received  value  on  each  output,  and  do  some  computation. 


Proof:  By  induction  on  the  length  of  the  longest  path  ending  in  a  processor.  Let  F{P) 
be  the  length  of  this  path  for  processor  P  (the  length  of  a  path  is  the  number  of  nodes 
on  the  path,  so  F(q)  =  1  if  q  has  no  inputs).  The  induction  hypotheses  are;  ()^l)  that 
P  completes  its  work  in  0(i(n)w*(n)F(P))  time,  and  (f/2)  that  by  P(P)  +  0(j)  it  has 
received  j  values  on  each  of  the  input  lines  that  has  that  many  values  to  send.  Processor 
P  receives  at  most  v(n)P(P)  input  values  on  each  input  line,  and  per  (>/2)  it  will  take  at 
most  P(P)  +  ejrk  time  units  to  receive  rk  values  from  an  input  line  that  is  due  to  send  k. 

Suppose  r  =  <  s  <  1.  After  F(P)  +  cirk  it  will  have  received  rk  values,  which  is 

•11  but  jjll^k,  and  by  one  time  unit  later  it  will  have  retransmitted  them  all.  This  validates 
(^2)  for  processors  in  :  F(Q)  <  F(P)  +  1). 

We  have  a  second  induction  on  the  amount  of  time  that  has  passed.  Say  the  constant  of 
(yi)  is  Cf.  Cl  >  cj  or  processor  P  would  be  able  to  complete  its  work  before  it  received  all 
of  its  input.  If  P  must  complete  and  use  m  fast  node  computations  then  ()fS)  is  that  by 
the  time  civ*(n)P(P)^^^|p  it  will  have  been  able  to  complete  feml  of  them.  The  base  case 
0  =  0  requires  nothing,  and  at  the  time  civ*(n)P(P)i^|^  there  will  be  at  most  f(l  -  s)m] 
fast  sets  for  which  all  values  have  not  been  received.  If  is  true  for  a  given  s  then  it  must 
be  true  for  •  +  the  information  will  be  available  by  M2  and  it  will  be  possible  to  use  it 
sufiBciently  fast  for  some  cj  by  >^3. 

is  immediate  from  MS,  and  the  theorem  is  immediate  from  Xl.  | 


Theorem  B.S  The  aogmptotie  speed  of  the  reoult  of  a  reduction  is  equal  to  that  of  Vie 
unreduced  parallel  otrueture. 


i3 


Proof:  Hie  reduced  parallel  etructure  finishes  in  0(i(n)v*(n)/(n))  units  of  time.  For  any 
■ingle  reduced  structure  t(n)  and  v(n)  will  be  constant  functions,  so  it  must  only  be  shown 
that  the  performance  of  the  unreduced  structure  is  no  better  than  0(l{n)).  But  by  the 
snowballing  property  a  chain  of  length  /(n)  can  only  have  arisen  if  there  was  a  processor  in 
the  unreduced  structure  that  receives  /(n)  values,  which  it  would  certainly  require  0(/(n)) 
time  to  process.  | 

C  Quantifier  Levelling  Proofs 


We  used  three  identifiers  on  quantifiers  during  the  quantifier  levelling  of  Chapter  6.  The 
need  for  these  identies  arises  from  the  fact  that  we  have  a  predicate  with  a  quantifier  whose 
bound  variable  is  bounded  on  both  ends  and  which  occurs  in  a  context  in  which  a  series  of 
values  obtained  by  varying  both  bounds  is  desired.  The  computation  is  expensive  because 
a  two  dimensional  array  of  values  is  needed,  and  nothing  analogous  to  a  parallel  prefix 
operation  is  directly  available. 

The  values  that  are  found  in  this  two  dimensional  array  are  by  no  means  independent, 
and  with  some  manipulation  the  array  can  be  "squashed”  into  a  pair  of  vectors.  What 
we  accomplish  by  the  identities  that  we  exploit  in  Section  6.2  and  demonstrate  here  is 
this  squashing  of  the  array  by  summarizing  intervals  over  which  predicates  with  bounded 
quantifiers  are  true  as  pairs  of  integers. 

We  display  the  identities  below  and  then  give  the  proofr. 

V/  <  *  <  u[P(*)]  •P(*))  S  I  (V-lo-mox) 

3*  <  «(P(x)  A  F(u)  <  x]  =3F(u)  <  X  <  u{P(x))  (eonstraint-to>6inder) 

3/  <  X  <  u(P(x)]  =nm[P(x)]  >  1  (3-to-mox) 


Theorem  C.l 

(V/  <  I  <  u)[P(*)]  s  ni«{  P(x)]  <  /  (^-to-max) 

Proof:  By  convention,  maxs<«[fa]ae]  =  — oo  where  (Vt)[-oo  < «].  We  have 

=  y  =^(y)  A  (Vy  <  X  <  u)[  P(x)]  (1) 

and 

n^[P(x)]  <  y  s(Va:  <  «)(  P(x)]  V  (3w  <  y)[P(w)  A  (Vw  <  x  <  u)[  P(w)]]  (2) 

These  are  firom  the  definition  of  maxc<u[P(x)]  as  that  z  such  that  P(x)  is  indeed  true  and 
that  P(x)  is  false  for  any  x  <  x  <  u. 

We  also  have 

w<y<u  =>((V«; < X < u)[  P(«;)]  =  ((Vti> < x  <  y)[  P(u;)]  A  (Vy  <  w< u)[  P(w)l))  (So) 
and 

y  <  u  =>((Vx  <  «)[  P(x)]  =  ((Viw  <  y)(  P(ti;)]  A  (Vy  <  iv  <  u)[  P(w)]))  (36) 

as  these  forms  merely  split  a  quantified  predicate,  which  is  a  statement  about  a  range  of 
integers,  into  a  cotijunction  of  statements  about  portions  of  that  range. 

So  (2)  becomes: 

<  y  =  0^*  ^  y)[  ^(*)]  A  (Vy  <  w  <  tt)(  P(u;)] 

V(3u;  <  y)[P(u>)  A  (Vw  <  x  <  y)(  P(x)]  A  (Vy  <  x  <  u)[  P(x)]]  (4) 

Factoring  (4),  (and  observing  that  the  inner  quantifier  of  (4)  is  independent  of  the  outer 
quantifier’s  bound  variables)  we  get 

<  y  =  ^  y)I  ^(*)I  V  (3®  ^  y)l^(“')  a  (Vw  <  x  <  y)[  p(x)]] 

A(Vy  <  w  <  «)[  P(t«;)]  (5) 

but  the  first  conjunct  of  (5)  is  true  by  the  definition  of  V  in  terms  of  3,  the  law  of  the 
excluded  middle,  and  the  fact  that  any  nonempty  subset  of  a  finite  set  of  integers  has  a 
maximal  element.  So  we  have  max,  <  ti[P(x)]  <  y  =  true  A  (Vy  <  w  <  u)[  P(«;)].  | 


The  next  identity  eimply  restates  a  singly  boundedly  quantified  predicate  which  includes  a 
restriction  on  the  bound  variable  of  the  quantifier  as  a  doubly  bounded  quantified  predicate 
without  the  restricti<m.  This  is  obvious  from  the  definitions. 


Theorem  C.2 

(3x  <  u)[P(i)  A  F{u)  <  x]  =  (3F(u)  <  X  <  «)[P(x))  {eon»traint-to~binder) 

Proof:  Immediate  from  the  definition  (3/  <  x  <  tt)|P(x)]  =  (3x)[P(x)  A/<xAx<u].  | 

Theorem  C.3 

(3/  <  X  <  «)[P(*)1  =  max[P(x))  >  /  (3-to-mox) 

Proof:  This  is  the  dual  of  (V-to-max).  | 

D  Theorem  Reduction  Forms 

Below  are  the  rules  used  to  transform  TRANSCONS’s  (THEOREM  . . .)  forms  into  Pres- 
burger  Arithmetic  with  restricted  quantifier  depth: 


potnhle  —  theorem  :=  eboolean  \  eboolean  [  A  /  V  /  ^  etc.  eboolean]  |  ~  eboolean 

•boolean  :=  eet  =  •et\  beet  €  explieit  —  set  -  of  -  eete  \  eet  Z>  eet 

I  |6«etl|  +  constant  =  \beet2\  A  beet2  O  beetl 

eet  “  [U  /  n  /  “  *«*]  I 

beet  :=  (ordinary  setformers) 

The  suflSciency  of  these  concta  can  be  argued  from  the  following  observations: 

•  Telescoping  and  Snowballing  can  be  expressed  in  this  language,  provided  the  under¬ 
lying  sets  are  expressible  with  legal  setformers.  (This  proviso  will  not  be  rested  in 
what  follows.) 

•  The  answers  to  value-flow  questions  (eg.  REACHES)  can  be  phrased  in  this  form. 
A  node  reachable  from  two  places  will  generate  a  Ui  ^  IF  may  generate  an  f)  or  ^ 
A  or  a  and  nested  loops  generate  setformers  with  multiple  bindings. 

•  A  question  of  whether  a  processor  from  a  given  set  is  connected  to  a  processor  from 
another  given  set  can  be  asked  in  this  form  if  the  two  sets  of  processors  are  expressable 
by  ITCONST  rules. 

We  have  a  form,  (THEOREM  . . .),  where  the  argument  is  a  V  expression  conforming  to 
the  above  syntax.  Rules,  to  be  displayed  below,  are  used  to  reduce  these  expressions  to 
longer  but  simpler  ones  from  P.  A.  This  language  is  adequate  for  the  REDUCE-HEARS 
rule,  which  reduces  the  communication  paths  of  an  amenable  network  to  a  smaller  one. 
We  claim  it  will  prove  to  be  adequate  for  future  needs. 

The  form  of  a  setformer  is  {(expr) :  ((bvlist})(exprli8t)  |  (predicate)}. 

(expr)  is  an  expression  linear  in  all  variables  of  ((bvlist)).  ((bvlist))  is  a  list  of  variables. 
A  (lexical)  binding  scope  is  created  for  each  variable  that  includes  the  whole  set  former 


(and  no  more).  The  (exprlist)  had  better  be  a  list  of  expressions.  For  simplicity  we  require 
that  each  be  of  the  form  bv  €  set  where  hv  is  from  the  (bvlist)  and  set  is  itself  either  an 
admissible  set  former,  an  integer  subrange,  or  the  finite  union/intersection  of  the  above. 
*1  (predicate)”  is  optional,  but  where  present  it  is  a  boolean  expression.  (When  not  present, 
“true”  is  used.) 

The  (THEOREM  . . .)  function  receives  a  general  V  expression  and  applies  the  transfor¬ 
mations  given  above  until  no  change.  It  passes  the  result,  a  P.  A.  expression,  to  the  real 
theorem  prover. 

Below  we  list  the  rules  for  converting  forms  from  the  (THEOREM  . . .)  language.  The 
transformations  below  all  preserve  semantics.  If  we  group  the  allowable  connectives  as 
follows:  |setl|  +  (const)  =  |set2| . ..,U ^  we  can  easily  observe  that  the  trans¬ 
formations  below  always  strictly  reduce  the  number  of  occurences  of  the  highest  ranked 
connective  that  they  touch  at  all.  Since  an  instance  of  the  highest  ranked  non-P.  A.  form 
in  an  expression  is  always  reducible,  the  process  terminates  with  no  such  forms  left. 

Each  of  the  rules  given  below  performs  some  of  the  reduction  by  decreasing  the  number 
of  places  it  applies  without  increasing  the  number  of  places  that  rules  appearing  earlier  in 
this  list  apply. 

rule  SPECIFIC-SIZB-DIFFERENCE-SUPERSET  (a)  TRANSFORM 
a  :  ‘||aetl||  -i-  (constant)  =  ||aet2||  A  aet2  D  »etl' 

—4 

a  :  '3Xi,Xi,. . . <  i, j  <  (conat) 

1  Xi  Xj  A Vy  €  art2(y  €  aetl  vr  =  AiVy  =  X,v...vy  = 

rule  TEST-IF- ANY -EQUAL  (a)  TRANSFORM 
*:‘*€{yi,yj,...,y,}' 


rule  SBTFORMER-INCLUSION-TO-FUNCTION-TEST  (»)  TRANSFORM 

•  :  ‘{Fiibi)  :  4i  €  .1  I  Ti{bi))  3  :  6,  €  •,  |  T, (&,)}' 

•  :  “Vkil  6i  €  «i  A  T|(6i)  — »  eat  A  Ttlbf)  A  Fi{bi)  =  Fj(6j)]]* 

rule  SET-  =  -TO-2-W AY -INCLUSION  (•)  TRANSFORM 
»  :  ‘eetl  =  tet2' 

»  :  ‘aetl  3  •et2  A  •et2  3  tell' 


rule  EMPTY -SET-FROM-SETFORMER  (•)  TRANSFORM 

•  :‘{F(6):(6)6€«  |T(6)}  =  0' 

•  :  ‘V6[6  ~€  *V  ~  T(6)]' 

rule  UNION-TO-OR  (e)  TRANSFORM 
• :  ‘o€  xUv* 

«  :*a€  zV  a  e  ]/ 

rule  INTERSECTION-TO-AND  (•)  TRANSFORM 

•  :  ‘o  6  xfly* 

e  :*a€  z  Aa€  ]/ 

rule  INTEGER-SUBRANGE-MEMBERSHIP-TO-INEQUALITIES  (•)  TRANSFORM 
«  :  ‘a  €  {/ow . . .  high}' 

$  :*a>  low  A  o  <  high' 


rule  MEMBER-SETFORMER  («)  TRANSFORM 

•  :‘a€{F(6):fc€#|r(6)}' 


«  :  ‘36(6  €  «  A  T(6)  A  o  =  F{b)]' 
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rule  TWO-LAY ER-MEMBER-SETFORMER  (*)  TRANSFORM 
#:‘o€{F(&.c):6e.,c£t(6)|P(6,c)y 

•  :  ‘34[  6  e  •  A  3c[c  €  t(6)  A  P(6.e)  A  o  =  P(6,c)]]' 
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